active diag

DX’06

17th International Workshop on Principles of Diagnosis

Peñaranda de Duero, Burgos Spain

June 26-28, 2006

Edited by

Carlos Alonso González, Teresa Escobet, and Belarmino Pulido

Foreword The International Workshop on Principles of Diagnosis is an annual event that started in 1989,

rooted in the Artificial Intelligence community. Its focus is on theories, principles and computational techniques for diagnosis, monitoring, testing, reconfiguration and repair of complex systems. As in past editions, in DX-06 take part scientists and industrialists with diverse interests concerning diagnosis, and with different backgrounds.

Many people and organisations have contributed to this workshop.

We would first like to thank the authors that have provided the primary material of the workshop, a set of papers of outstanding quality. We also thank the Programme Committee members for the time and effort devoted to review the papers and their help selecting the contributions. Special thanks to the invited speakers, César Barta from Iberespacio, and Andrés Marcos from Deimos-Space, for having accepted to contribute to the workshop, and their time for preparing their talks.

Previous to this DX edition, we had the second edition of the Summer School on “Fault Detection and Diagnosis of Complex Systems”. The main goal of this intensive seminar is the introduction to PhD students and companies to different advanced techniques currently used for fault detection and diagnosis. We want to specially thank Gautam Biswas, Luca Console, and Louise Travé-Massuyès for their contribution to the success of the School.

This workshop would not have been possible without the support of the local organisation teams. In Valladolid, Aníbal Bregón, Isaac Moro, Oscar Prieto, and Arancha Simón have worked very hard in any single task we have asked help for. In Burgos, Juanjo Rodríguez, Jesús Maudes, and Santiago Villalba have provided invaluable help related to the conference location. And Alberto Calvete created the workshop web-site.

We would like to acknowledge the contribution from different local, regional and national institutions and organizations: Ayuntamiento de Peñaranda de Duero, Departamento de Informática from Universidad de Valladolid, Universidad de Burgos, Junta de Castilla y León, Asociación Española para la Inteligencia Artificial, Ministerio de Educación y Ciencia, Caja Círculo (Burgos) and Consejo Regulador Ribera de Duero.

We are also thankful to our institutions Universidad de Valladolid, Universidad de Burgos, and UPC for all the financial and infrastructure support for organising the workshop.

Finally, we want to welcome DX06 participants. We hope this workshop will provide each participant with interesting and stimulating presentations and discussions as well as enjoyable social events.

Carlos Alonso, Teresa Escobet and Belarmino Pulido DX06 Co-Chairs

Workshop Organization

Programme Committee Co-Chairs Carlos Alonso González Universidad de Valladolid, Spain Teresa Escobet Universitat Politècnica de Catalunya, Spain Belarmino Pulido Universidad de Valladolid, Spain

Programme Committee Members Gautam Biswas Vanderbilt University, USA Luca Console University of Turin, Italy Marie-Odile Cordier IRISA, France Philippe Dague Paris South University, France Johan de Kleer Xerox, USA Richard Dearden University of Birmingham, U.K. Michael Hofbaur Graz University of Technology, Austria Sriram Narasiman QSS, NASA Ames Research Center Pieter Mosterman The MathWorks, Inc. Mattias Nyberg Scania CV, Södertälje, Sweden Xavier Olive Alcatel Space, France Yannick Pencolé LAAS-CNRS, Toulouse, France Claudia Picardi University of Turin, Italy Gregory Provan Cork College, Ireland Martin Sachenbacher University of Munich, Germany Marcel Staroswiecki University of Lille, France Peter Struss Technical University of Munich, Germany Markus Stumpter University of South Australia, Adelaide, Australia Daniele Theseider-Dupre University of Piamonte Orientale, Italy Louise Travé-Massuyès LAAS-CNRS, Toulouse, France Brian Williams MIT, USA Franz Wotawa Graz University of Technology, Austria Marina Zanella University of Brescia, Italy

Reviewers Arantza Aldea Oxford Brookes University, UK Matthew Daigle Vanderbilt University, USA Michael Esser Technische Universität München , Germany Maria J. de la Fuente Universidad de Valladolid, Spain Arjan van Gemund Delft University of Technology, Netherlands Xenofon Koutsoukos Vanderbilt University, USA Wolfgang Mayer University of South Australia, Australia Bernhard Peischl Technische Universität Graz, Austria Xavier Pucel LAAS-CNRS, Toulouse, France Indranil Roychoudhury Vanderbilt University, USA Juan J. Rodríguez Diez Universidad de Burgos, Spain Marcos Da Silveira LAAS-CNRS, Toulouse, France

Local Organization Committee Pulido Junquera, Belarmino Universidad de Valladolid, Spain Rodríguez Diez, Juan José Universidad de Burgos, Spain Alonso González, Carlos Universidad de Valladolid, Spain Bregón Bregón, Aníbal Universidad de Valladolid, Spain Maudes Raedo, Jesus Universidad de Burgos, Spain Moro Sancho, Q. Isaac Universidad de Valladolid, Spain Prieto Izquierdo, Oscar Universidad de Valladolid, Spain Simón Hurtado, Arancha Universidad de Valladolid, Spain Villalba Bartolomé, Santiago Universidad de Burgos, Spain

Table Of Contents

Invited Talks Automatic Diagnosis in the Space Technogoly Field. European Programs ................................................ 3

César Barta. Robust FDI Estimation in Aerospace Applications..................................................................................... 5

Andrés Marcos.

Papers Hybrid Systems Diagnosability by abstracting Faulty Continuous Dynamics ............................................ 9

Mehdi Bayoudh, Louise Travé-Massuyès, and Xavier Olive.

Distributed Diagnosis by using a Condensed Local Representation of the Global Diagnoses with Minimal Cardinality................................................................................................................................................. 23

Jonas Biteus, Erik Frisk, and Mattias Nyberg. Ambiguity Groups Determination for Analog Non-Linear Circuits Diagnosis......................................... 39

Barbara Cannas, Alessandra Fanni, and Augusto Montisci.

Exploiting independence in a decentralised and incremental approach of diagnosis ................................ 61 Marie-Odile Cordier, and Alban Grastien.

Multiple Fault Diagnosis in Complex Physical Systems........................................................................... 69

Matthew Daigle, Xenofon Koutsoukos, and Gautam Biswas. Improvement of Chronicle-based Monitoring using Temporal Focalization and Hierarchization. ........... 77

Christophe Dousson, and Pierre Le Maigat. Model-based Test Generation for Embedded Software............................................................................. 85

Michael Esser, and Peter Struss.

A Multi-Valued SAT-Based Algorithm for Faster Model-Based Diagnosis............................................. 93 Alexander Feldman, Jurryt Pietersma, and Arjan van Gemund.

A general method for diagnosing axioms................................................................................................ 101

Gerhard Friedrich, Stefan Rass, and Kostyantyn Shchekotykhin. Robust Fault Detection with State Estimators and Interval Models Using Zonotopes..................... 109

Pedro Guerra, Vicenç Puig, and Ari Ingimundarson. Supervision Patterns in Discrete Event Systems Diagnosis .................................................................... 117

Thierry Jéron, Hervé Marchand, Sophie Pinchinat, and Marie-Odile Cordier. Primary and Secondary Plan Diagnosis................................................................................................... 133

Fenke de Jonge, Nico Roos, and Cees Witteveen. Getting the Probabilities Right for Measurement Selection .................................................................... 141

Johan de Kleer.

Runtime Fault Detection and Localization in Component-oriented Software Systems.................... 195 Berhard Peischl, Joerg Weber, and Franz Wotawa.

A bayesian approach to fault isolation with application to diesel engine diagnosis ................................ 211

Anna Pernestål, Mattias Nyberg, and Bo Wahlberg. Automatic Generation of Benchmark Diagnosis Models ........................................................................ 219

Gregory Provan. A Bayesian Approach to Efficient Diagnosis of Incipient Faults............................................................ 243

Indranil Roychoudhury, Gautam Biswas, and Xenofon Koutsoukos.

Qualitative Domain Abstractions for Time-Varying Systems: an Approach based on Reusable Abstraction Fragments ................................................................................................................................................ 265

Gianluca Torta, and Pietro Torasso. Reliability and Diagnostics of Modular Systems: a New Probabilistic Approach .................................. 273

Michael Wachter, Rolf Haenni, and Jacek Jonczy.

Posters Towards an Entropic Approach for the Analysis of Chronicle Models..................................................... 17 Nabil Benayadi, Marc Le Goc, and Philippe Bouché Focusing fault localization in model-based diagnosis with case-based reasoning .................................. 31

Anibal Bregon, Belarmino Pulido, M. Aranzazu Simon, Isaac Moro, Oscar Prieto, Juan J. Rodriguez, and Carlos Alonso.

A Framework for Decentralized Qualitative Model-based Diagnosis....................................................... 47 Luca Console, Claudia Picardi, and Daniele Theseider Dupré Comparing Diagnosability in Continuous and Discrete-Events Systems.................................................. 55 Marie-Odile Cordier, Louise Travé-Massuyès, and Xavier Pucel On-line diagnosis for Time Petri Nets ..................................................................................................... 125 G. Jiroveanu, G. B. De Schutter, and R. K. Boel. Incremental indexing of temporal observations in diagnosis of active systems ...................................... 147 Gianfranco Lamperti, and Marina Zanella Introducing Data Reduction Techniques into Reason Maintenance........................................................ 155

Rüdiger Lunde A Supervision Architecture to Deal with Disruptive Events in UAV Missions ...................................... 163

Rachid El Mafkouk, Jean-François Gabard, and Catherine Tessier. Debugging Failures in Web Services Coordination ................................................................................ 171

Wolfgang Mayer, and Markus Stumptner. Observer Gain Effect in Linear Interval Observer-based Fault Isolation ................................................ 179

Jordi Meseguer, Vicenç Puig, Teresa Escobet, and Joseba Quevedo. A Generalization of the GDE Minimal-Hitting Set Algorithm to Handle Behavioral Modes................. 187 Mattias Nyberg. Abstract Dependence Models in Software Debugging............................................................................ 203 Bernhard Peischl, Saffeeullah Soomro, and Franz Wotawa. Robust Fault Detection using Set-membership Estimation and Constraints Satisfaction.................... 227

Vicenç Puig, Carlos Ocampo-Martínez, Sebastián Tornil, and Ari Ingimundarson.

Hierarchical Modelling and Diagnosis for Embedded Systems .............................................................. 235 Hervé Ressencourt, Louise Travé-Massuyès, and Jérôme Thomas. Intermittent Fault Detection through Message Exchanges : a Coherence Based Approach.................... 251

Siegfried Soldani, Michel Combacau, Jerôme Thomas, and Audine Subias Distributed Trace Estimation with Asynchronous Local Clocks and Imperfect Observation Channels . 257

Rong Su, and Michel Chaudron.

Invited Talks

Automatic Diagnosis in the Space Technology field European Programs

César Barta, Iberespacio, Tecnología Aeroespacial, C/ Magallanes 1, 1ª Planta

Madrid, Spain

A Next Generation European Reusable Launcher (RLV) is under evaluation. The evolution of the current Ariane 5 ECA as an Expandable (ELV) version is also possible. In any case, the reference missions will include a mission to geostationary orbit covering the market of commercial heavy telecommunication satellites (4 to 6 tons in GEO). Crew transportation shall not be considered as a design criterion. The program development schedule foresees a Next Generation Launcher operational around 2020.

The ELV reliability is defined based on a single mission (roughly one hour). Even if the European rocket Ariane is one of the most reliable of the worldwide industry, its history shows several catastrophic mission failures. For Ariane 5, the reliability target figure is 14*10-3. The RLV will have to perform several missions (≅100 missions), all with an increased level of reliability (10-3). To get this reliability increase, the new approach allowing failure detection and recovery for vehicle preservation will have to be implemented.

In this field of spacecraft systems, health monitoring is the surveillance by means of sensors and signal processing units to allow a description of the system to detect and isolate operational anomalies. An effective Health Monitoring System accomplishes detection and identification of failure causes whereas an optimised Health Management System, HMS, guarantees the selection of appropriate actions to recover from faulty conditions. In this way, the on-board automatic diagnosis task will be accompanied by the flight control. The HMS autonomy level should be decided.

Health Management should be present during flight and on ground and in between flights maintenance periods. It applies across the entire life cycle of the vehicle, beginning in the earliest phases of design. For the reference missions, a HMS has two primary functions: to increase safety and reliability and to decrease maintenance turn-around time and cost. To carry out all of these functions, the HMS will have two components: the on-board and the off-board HM subsystems. The on-board HM sub-system will support the ground mission operation and flight supervision.

The SSME is the only rocket with reusable engine. Some diagnostic systems were made for it and a laboratory test-bed was built at NASA. Researchers demonstrated the successful real-time fault detection and isolation of a model-based reactive autonomous system. Deep Space One and X-37 IVHM are others NASA experiments.

Several approaches to the HMS concept have had a limited success. On one extreme, the most usual implementation is based on rules algorithms by checking critical levels of a set of variables. The practice shows that an important number of false alarms are raised (example is Ariane 5 L501). On the other extreme, knowledge implemented by expert systems often can not manage the whole spectrum of possible failure leading to blackouts in the decision loops (Deep Space 1 remote agent deadlock). To fulfil the above functions, several development lines may be investigated: Structural and Turbo-machinery health monitoring, Knowledge-based, Case-based, Machine-learning or Model-based approaches, the advanced sensors, Rule-based diagnostic, etc.

DX'06 - Peñaranda de Duero, Burgos (Spain) 3

4 DX'06 - Peñaranda de Duero, Burgos (Spain)

Robust FDI Estimation in Aerospace Applications

Andres Marcos, Ph.D. Advanced Projects Division (Simulation and Control)

DEIMOS Space S.L. Ronda de Poniente 19

Edificio Fiteni VI, P2, 2 Tres Cantos, Madrid - 28760

SPAIN email: [email protected]

In this talk we present an overview of the current State-Of-Art in model-based robust fault

detection & isolation (FDI) for guidance-navigation-control (GNC) in the aerospace world. Robust FDI is a critical component in aerospace systems due to the autonomous characteristics of the systems (interplanetary missions with long communications delay, re-entry vehicles black-out, satellites' eclipses, etcetera), the environment where they operate (atmospheric, exo-atmospheric and space) and the aggressive system dynamics (high rotational and translational components). The standard approach in aeronautics and aerospace applications has been to use hardware redundancy and voting schemes but nowadays there is a trend to reduce this redundancy using advanced diagnostic techniques in conjunction with fault tolerant control (FTC) approaches. A summary of recent missions and projects where FDI/FTC has played a major role is presented together with a discussion of the specific problems/requirements for this type of applications and of the robust FDI techniques currently used.


Papers / Posters

Hybrid Systems Diagnosability by Abstracting Faulty Continuous Dynamics

Mehdi Bayoudh and Louise Trave-MassuyesLAAS-CNRS

Toulouse, Francebayoudh, [email protected]

Xavier OliveAlcatel Alenia Space

[email protected]

Abstract

On-line model based reconfiguration is generallyused to improve the ability of a system to toler-ate faults. Recovery after fault occurrence relieson allowing the system to proceed with its missionfrom a new known nominal state. In this paper, weconsider on-line reconfiguration from a novel pointof view, having in mind to use reconfiguration ac-tions to disambiguate the tracked estimated systemstate, i.e. to produce a more precise diagnosis. Thechoice of the best suited reconfiguration action(s)must hence be guided by the diagnosability prop-erties of the system. However, diagnosability con-ditions known for continuous systems (CS) on onehand and for discrete event systems (DES) on theother hand cannot be applied directly because of thehybrid nature of the systems that we consider. Ourwork proposes a framework for analyzing the abil-ity of a hybrid system that stands on recent resultsestablishing the formal equivalence of diagnosabil-ity definitions for DES and CS. The approach re-lies on merging the fault signatures exhibited at thecontinuous level into the Mode Automaton that rep-resents the discrete dynamics of the system, so thatDES diagnosability analysis can be performed onthe resulting Behavior Automaton and the corre-sponding diagnoser. When the state of the system isambiguous, an analysis of the diagnoser allows usto point at reconfiguration actions that safely movethe system into a mode reducing ambiguity.

1 IntroductionEmbedded systems found in nowadays cars, aircrafts andspace vehicles are characterized by a mix of hardware andsoftware components and limited instrumentation. Theyhence undergo complex hybrid dynamics that can only bepartially observed, which makes tricky their on-board moni-toring and diagnosis. They generally require to use stochasticand/or uncertain approaches which provide a belief state or inother words an ambiguous diagnosis[Hofbaur and Williams,2004] [Benazeraet al., 2002] [Williams and Nayak, 1999]. Inmany cases, testing the system on line can be an interestingoption to produce a more precise diagnosis. For instance, in

the space domain, specific commands are often applied by theground segment for getting more information about the stateof a faulty spacecraft.This kind of testing, that we callactive diagnosis, involvesreconfiguring the system so that new symptoms are exhibitedthrough the existing sensor instrumentation. The choice ofthe best suited reconfiguration action(s) must hence be guidedby the diagnosability properties of the system. Diagnosabilityanalysis proves to be a requisite for several other tasks suchas instrumentation design, end-of-line testing, etc. and hasdeserved a lot of attention from the Model Based Diagnosiscommunity in the last few years, both for the analysis of Con-tinuous Systems (CS) and Discrete Events Systems (DES)[Struss and Dressler, 2003] [Consoleet al., 2000] [Trave-Massuyeset al., 2004][Sampathet al., 1995] [Pencole, 2004].However, diagnosability conditions known for continuoussystems (CS) on one hand, and for discrete event systems(DES) on the other hand, cannot be applied directly whenthe system has hybrid dynamics. We rely on recent resultsestablishing the formal equivalence of diagnosability defini-tions for DES and CS[Cordieret al., 2006] and propose toabstract the faulty continuous dynamics of a hybrid automa-ton to produce an enriched discrete automaton that accountsfor fault models. Fault models are obtained from fault signa-tures exhibited from the continuous dynamics constraints thatare interpreted in terms of events.DES diagnosability analysis can be performed on the result-ing Behavior Automaton and the corresponding diagnoser.When the state of the system is ambiguous, an analysis of thediagnoser allows us to point at reconfiguration actions thatsafely move the system into a mode reducing ambiguity.The paper is organized as follows. Section 2 introduces thehybrid modeling framework used for tracking the states of thesystem. Section 3 provides an insight into fault signatures asdefined for continuous systems and gives the intuitions guid-ing our contribution. Section 4 then introduces the main DESdiagnosability notions. Section 5 presents the procedure forbuilding the Behavior Automaton from the Mode Automa-ton and fault signatures exhibited at the continuous behaviorlevel. Section 6 presents some criteria for hybrid systems di-agnosability. Finally, section 7 illustrates our approach with amotivational example and it is shown how reconfiguration ac-tions can increase diagnosability. Related work, perspectivesfor future work and a concluding discussion are provided in


section 8.

2 Hybrid modeling frameworkEmbedded systems combine continuous dynamics with dis-crete events (which can be commanded or spontaneous).Hence, the hybrid formalism is appropriate for modelingsuch complex dynamic systems. Like in[Benazeraet al.,2002][Benazera and Trave-Massuyes, 2003], a hybrid sys-tem is described by a hybrid automaton defined as a tupleS = (X, Q, Σ, T, C, (x0, q0)), where:

• X is the set of continuous variables, which includes ob-servable and non observable variables. Those variablesare linked with constraints that vary from one mode toanother.

• Q is the set of system states. Each stateqi ∈ Q repre-sents a functional mode of the system.

• Σ is the set of events. Events correspond to commandvalue switches, spontaneous mode changes and faultevents.

• Σo ⊆ Σ is the set of observable events. Without loss ofgenerality we assume that fault events are unobservable.

• T is the transition function, Q× Σ → Q.

• C is the set of constraints which may be qualitative orquantitative. Associating a subset of constraintsCi ⊆ Cto functional modeqi allows one to describe the systembehavior evolution in this mode.

• (x0, q0) is the initial condition.

The discrete part of the hybrid automaton, given byM =(Q, Σ, T, q0), is a discrete automaton that describes the dis-crete dynamics of the system, i.e. the possible evolutionsbetween operating modes inQ. We refer to this automatonas theMode Automaton. Modes include nominal and faultmodes as well as an unknown mode which stands for all thenon anticipated faulty situations. The unknown mode has nospecified underlying behavior and hence no associated con-straints.

3 Fault SignaturesThe constraints in each modeqi can be brought back to aset of consistency tests. Following the parity space approach,consistency tests take the form of analytical redundancy rela-tions (SARRqi

) obtained by eliminating non observable vari-ables[Cordieret al., 2004]. An ARR can be expressed asr = 0, wherer is called the residual of the ARR. The ARRsare constraints that only contain observable variables. Theycan be determined off-line and then be evaluated on-line withthe incoming observations, allowing one to check the consis-tency of the observed against the predicted system’s behavior.They are satisfied if the observed behavior satisfies the modelconstraints, in this case the associated residuals are zero. Inthe opposite case, all or some of the residuals are non zero.The set of residuals hence results in a boolean fault indicatortuple. The expected boolean value pattern for a given faultprovides thefault signature. In our hybrid framework, the set

of ARRs associated with each functional system mode is gen-erally different, although some ARRs may be shared. A faulthence manifests by the fact that a subset of residuals switchesto a non zero value, whereas other residuals may switch froman undetermined value to zero.

Definition 1 Given a set[r1, ..., rn] of n residuals and a setF = [F1, F2, ..., Fm] of m faults, the signature of a faultFj

is given by the binary vectorFSj = [s1j , ..., snj ]T , sij = 1

if some components affected byFj are involved inARRi,sij = 0 otherwise.

Residuals and fault signatures provide an abstracted informa-tion about the continuous dynamics of the system, which issufficient for characterizing the system’s nominal or faultystate.When a fault occurs, fault signatures can be interpreted interms of events referring to the residuals switching values.Our goal is to take advantage of this event driven informationto enrich the system’s Mode Automaton, abstracting the con-tinuous dynamics of the hybrid automaton into an extendeddiscrete automaton that we call theBehavior Automaton. Thediagnosability of the hybrid system can thus be analyzed fromthe Behavior Automaton, by using discrete event systems cri-teria[Sampathet al., 1995].

4 DES diagnosability Analysis4.1 Diagnosability definitionDiagnosability is the property of a system and its observables,i.e. set of all the possible observations, that guarantees that aset of anticipated fault situations can be assessed and distin-guished. Diagnosability definitions have been provided inde-pendently for CS[Trave-Massuyeset al., 2004][Struss andDressler, 2003][Frisk et al., 2003] and for DES[Sampathetal., 1995] [Roze and Cordier, 2002]. However, recent resultshave proved that definitions on both sides are formally equiv-alent[Cordieret al., 2006]. We take benefit of this result andpropose to interpret the CS fault signatures, which are the keydiagnosability concept, in terms of an automaton that can bemerged into the discrete dynamics model. In this way, thediagnosability problem for the hybrid system is brought backto the diagnosability problem for an extended DES system.In consequence, this section restricts the presentation to theDES diagnosability definition and analysis through the so-called diagnoser[Sampathet al., 1995].A DES is modeled by a finite state machineM =(Q, Σ, T, q0) where Q is the set of states,Σ is the set ofevents,T ⊆ (Q × Σ × Q) the transition function andq0 theinitial state, as already defined in section 2. The event setΣ ispartitioned asΣ = Σuo ∪ Σo, whereΣuo is the unobservableevent set andΣo the observable event set. Observable eventsare system commands or events generated from the sensors.In our approach, these latter observable events are the resid-ual value switches. We considerΣF ⊆ Σuo as the set of faultevents to be diagnosed. In the DES community, the diagnosisconsists in the deduction of unobservable fault events fromthe observable traces generated by the system.

Definition 2 A fault F is diagnosable iff its occurrence is al-ways followed by a finite observable sequence of events that


allows us to diagnose F with certainty. The system is said tobe diagnosable iff all the anticipated faults are diagnosable.Formally, letsF t be a sequence of events (or trajectory) suchthatsF ends with the occurrence ofF , andt it is a continua-tion of sF . F is diagnosable iff:∀ trajectorysF t, ∃ an integern: length(t)≥ n⇒ (∀ trajectorys such thatPΣo

(s)=PΣo(sF t), F occurs ins) [Pencole, 2004],

wherePΣois the projection operator on the set of observable

events.

4.2 The diagnoserWe assume that M (defined in subsection 4.1) has no un-observable cycles (i.e cycles containing unobservable eventsonly). The set of fault eventsΣF is partitioned into dis-joint sets corresponding to different failure typesFi, ΣF =ΣF1

∪ ΣF2∪ ... ∪ ΣFn

andΣFi∩ ΣFj

= ∅, for i 6= j. Theaim of the diagnosis is to make inferences about past occur-rences of failure types on the basis of the observed events. Inorder to solve this problem the system model is directly con-verted into a diagnoser.The diagnoserDiag(M) = (QDiag, ΣDiag , TDiag, q0 Diag)is a deterministic finite state machine built from the systemmodelM = (Q, Σ, T, q0) [Sampathet al., 1995]:• q0Diag = (q0, ∅) is the initial state of the diagnoser.

• ΣDiag = Σo is the set of observable events of the sys-tem.

• QDiag is the set of states of the diagnoser:QDiag ⊆

2Q×2ΣF or QDiag ⊆ P(Q × P(ΣF )), whereP(E)denotes the power set ofE. The states of the diag-noser provide the set of diagnosis candidates as a setof couples whose first element refers to the state ofthe original system and the second is a label providingthe set of faults on the path leading to this state. Forexample, when the diagnoser is in the stateqDiag =(q1, ), (q2, F1, F3), it means that the systemMis in one of the statesq1, q2 as developed in table 11.

• TDiag is the diagnoser transition function built by a re-cursive process that consists in computing all the reach-able states from the diagnoser initial state and by prop-agating the diagnosis information. For more details see[Sampathet al., 1995].

Definition 3 Given a diagnoser stateqDiag ∈ QDiag, thisstate isFi-uncertain iffFi does not belong to all the labels ofthe state whereasFi belongs to at least one label of the state.

Theorem 1 The systemM is not diagnosable iff the associ-ated diagnoserDiag(M) contains an uncertain cycle, i.e. acycle in which there is at least oneFi-uncertain diagnoserstate for someFi and whose states also define a cycle in theoriginal systemM [Sampathet al., 1995].

5 Building the Behavior AutomatonAt the continuous level, a fault manifests itself as anticipatedin the fault signature, which reduces the detection task to de-tecting the violation of a subset of ARRs. In this paper, we

1We do not use the ambiguous label used in[Sampathet al.,1995], but we explicitly give the set of faulty system modes

System Diagnosis Commentsmodeq1 ∅ the system may be in the

nominal stateq1 (no faults)q2 F1, F3 the system may be in the

faulty stateq3 with adiagnosisF1, F3(the labelFi means thatat least one fault of typetypeΣFi

has occurred)

Table 1: TheF1, F3 uncertain state of the Diagnoser

propose to model the (nominal and faulty) continuous behav-ior of the hybrid system based on events referring to the setof ARRs associated to the different modes. For each mode ofthe system (nominal and faulty), we associate a so-calledM-Behavior Automatonconstructed from the knowledge of theresiduals that must switch value when transitioning to thismode: these residuals include a subset of the residuals asso-ciated to the departure modes that switch to non zero valueand the residuals in the current mode that must switch tozero. Notice that the same procedure is indifferently applica-ble for transitions triggered by command events, fault eventsor spontaneous events. Every M-Behavior Automaton statehence evolves with the occurrence of residual value switchesthat define a set of events. The M-Behavior Automata captureall the necessary information to determine the unobservableevents (fault or spontaneous) that occurred at the transition-ing between modes by the analysis of their observable trajec-tories.The system’s Behavior Automaton is obtained as an exten-sion of the Mode Automaton by the M-Behavior Automatafor each mode.Let SARRqi

= ARRi1, ARRi2, ..., ARRiNARR(qi) bethe set of ARRs associated to modeqi and Srqi

=ri1, ri2, ..., riNARRS(qi) the associated set of residuals,whereNARR(qi) is the number of ARRs in modeqi. Wedenote bySrsystem =

⋃i Srqi

the set of all residuals for thesystem and we denote byD = 0, 1, und the residual valuedomain, whereund stands for the undefined value that is usedto represent the case when the associated ARR is not definedin one given mode.Now, let us define the functione, which associates an eventto every residual value switch:

e : Srsystem ×D ×D \ DiagD×D −→ Σbehav

(rij , l, k) 7−→ elkij

whereDiagD×D is the set(und, und), (0, 0), (1, 1).The eventelk

ij is hence associated to the residualrij switchingfrom valuel to valuek.Remark The system changes from one modeqi1 to anothermodeqi2 iff, at least one evente01

i1j , 1 ≤ j ≤ NARRS(qi1 )

has occurred and all eventseund0i2j , 1 ≤ j ≤ NARRS(qi2 )

have occurred. In this paper, we deal with the general case forwhich the order of occurrence of these events is not specified.Additional temporal information[Puig et al., 2005] permitsto specify the order of event occurrence.


Definition 4 The M-Behavior Automaton for a given systemmodeqi (either nominal or faulty) is defined asM i

behav =(Qi

behav, Σibehav, T i

behav, qibehav 0), where:

• Qibehav is the set of M-Behavior Automaton states, each

stateqi,k ∈ Qibehav is characterized by an instance of

the global set of residualsSrsystem and the trajecto-ries exhibit the different possible order for the residualswitches.

• Σibehav ⊆ Σbehav is the set of events, each event corre-

sponds to one residual value switch.As stated before, the system’s Behavior Automaton is ob-tained as an extension of the Mode Automaton by the M-Behavior Automata for each mode. This procedure allowsus to generate the system’s Behavior Automaton in a mode-driven way, avoiding to enumerate all possible states (see Fig-ure 4). Indeed formally, the system’s Behavior Automaton isthe synchronous product of the automata defining all the pos-sible residual value switches in which all non accessible statesand impossible transitions, defined by the Mode Automaton,are discarded. The proof of equivalence is not provided inthis paper.

6 Hybrid Diagnosability AnalysisIn this section, the definition of diagnosability is extended tohybrid systems and sufficient conditions applying separatelyto the discrete part and to the continuous part of the systemare provided. The necessary and sufficient condition is thengiven in terms of the Behavior Automaton built in section 5.

Definition 5 An hybrid system is said to be diagnosable iff allthe anticipated faults are diagnosable. A faultFi is diagnos-able iff its occurrence is always followed by a sequence of ob-served events intermingled with continuous variable (or cor-responding residual) observations that permit to distinguishFi from all Fj , j 6= i, with certainty.

6.1 Conditions for Hybrid DiagnosabilityProposition 1 The hybrid system S =(X, Q, Σ, T, C, (x0, q0)) is diagnosable if its discretepart M = (Q, Σ, T, q0) (see section 2) is diagnosableaccording to the DES diagnosability conditions of theorem 1.The proof of this proposition is trivial.

In practice, the discrete part of the hybrid system is rarelydiagnosable because the Mode Automaton does not includeexplicit information about the events that occur after the oc-currence of a fault. Diagnosability can only be decided on thebasis of the command events observation and is generally notachieved.Proposition 2 The hybrid system S =(X, Q, Σ, T, C, (x0, q0)) is diagnosable if the underly-ing continuous systems corresponding to every mode are alldiagnosable according to the continuous systems diagnos-ability conditions[Trave-Massuyeset al., 2004].The proof of this proposition is trivial.In practice, the underlying continuous systems are seldom alldiagnosable for every mode, and onlyweak diagnosabiltyisachieved[Struss and Dressler, 2003][Trave-Massuyeset al.,2004].

Proposition 3 The hybrid system S =(X, Q, Σ, T, C, (x0, q0)) is diagnosable iff its BehaviorAutomaton is diagnosable according to the DES diagnos-ability conditions of Theorem 12

The Behavior Automaton is a discrete event model enrichedwith an abstraction of the continuous behaviors that combinesthe two aspects (continuous and discrete) of the hybrid sys-tem. In fact, the original discrete events provide importanttemporal information about the order in which continuoussignatures are expected after the occurrence of a fault. Forexample, they can be key in discriminating two faults hav-ing the same continuous signature but occurring in differentmodes (after different command events).

7 Motivational Example

sw

E

E

R1

R2

R3

R4

I 1

I 2

1

2

Figure 1: Example circuit

In this section, we take as an example the electrical circuit,(Figure 1), which has two nominal operating modes N1 andN2, commanded by a switch sw. Without loss of generality,we only deal with the faulty modes involving the componentsR1 and R2, (see Figure 2). In other words, we assume thatcomponents R3 and R4 are always in nominal mode.For sake of simplicity, we take the simple fault assumption.Two cases will be analyzed: first, the voltagesE1, E2 andthe currentsI1, I2 are observable, which is referred asCase1; second,E2 becomes unobservable, which is referred asCase 2.

7.1 Computation of Analytic RedundancyRelations for Case 1

The ARRs of the system are computed in the two nominalmodes and in the fault modes R1-short-circuit, R1-opened-circuit, R2-short-circuit and R2-opened-circuit as given in Ta-ble 2. In each fault mode, the commands ”on” and ”off” allowone to switch between the two possible configurations of thesystem (sw=off, and sw=on). In the following, for sake ofsimplicity, the ARRs are not indexed wrt modes like in sec-tion 5. This notation exhibits the shared ARRs in a clear way.

7.2 Diagnosability analysis for Case 1In Case 1, the system is diagnosable according to the continu-ous diagnosability criterion, which proves the diagnosability

2The proof of this proposition is not provided in this paper.


off

f3

4ff3

f 4

unknown

N2N1

R2 short circuit

R2 opened circuit

R1 short circuit

R1 opened circuit

1f f 21f2f

off off

off off

onon

on on

on

ufuffu

Figure 2: Mode Automaton of the system (transitions to theunknown mode are not all represented to avoid overprint)

System the corresponding ARRs setmodeN1 ARR1 : E1obs = (R1 + R3)I1obs + R3I2obs

ARR2 : I2obs = R1

R2I1obs

N2 ARR2 : I2obs = R1

R2I1obs

ARR3 : (1 + α)E1obs = E2obs + RI1 + R4I2

(sw=on)R1 ARR4 : (1 + α)E1obs = E2obs + R′I2obs

opened ARR5 : I1obs = 0circuit (sw=off)(f1) ARR5 : I1obs = 0

ARR6 : E1obs = (R2 + R3)I2obs

(sw=on)R1 ARR7 : (1 + α)E1obs = E2obs + R4I1obs

short ARR8 : I2obs = 0circuit (sw=off)(f2) ARR8 : I2obs = 0

ARR9 : E1obs = R3I1obs

(sw=on)R2 ARR10 : (1 + α)E1obs = E2obs + RI1obs

opened ARR8 : I2obs = 0circuit (sw=off)(f3) ARR8 : I2obs = 0

ARR11 : E1obs = (R1 + R3)I1obs

(sw=on)R2 ARR12 : (1 + α)E1obs = E2obs + R4I2obs

short ARR5 : I1obs = 0circuit (sw=off)(f4) ARR5 : I1obs = 0

ARR13 : E1obs = R3I2obs

Table 2: Table of ARRs classified by mode,R = (R1 +R4 +R1R4

R3

), R′ = (R2 + R4 + R2R4

R3

) andα = R4

R3

modes r1 r2 r8 r11 r5, r6

r9, r13

q10 0 0 und und undq11 0 1 und und undq12 0 1 0 und undq13 0 1 und 0 undq14 0 1 0 0 undq20 und 0 und und undq21 und 1 und und undq22 und 1 0 und und

Table 3: Mapping between the modes of the R2-openedcircuit-Behavior Automaton (Figure 4) and the instances ofthe residuals in case 2 .

property of the whole hybrid system by Proposition 2. In-deed, it is easy to see that the set of ARRs is always differentfrom one mode to another, implying that the fault continuoussignatures are different.

er201

N2N1

q30 q40

q32 q42

The fault mode R1 short circuit

q31 q41

e r8

und 0

f2 f2

er201

on

off

q33

q34

e r9

und 0

e r8

und 0

eund 0

e r9

und 0 r8

on

off

on

off

off

on

on

on

Figure 3: The R1-short-circuit-Behavior Automaton of Case2

7.3 Diagnosability analysis for Case 2In Case 2, the residualsr3, r4, r7,r10 and r12 become notcomputable, and by consequence their corresponding eventselk3 , elk

4 , elk7 , elk

10 andelk12, (l, k) ∈ D×D\DiagD×D must be

removed fromΣbehav.The updated set of ARRs leads to a non diagnosable under-lying continuous system. Since the discrete part is not diag-nosable on its own neither, none of the Sufficient conditionsgiven in Propositions 2 and 1 are fulfilled. Diagnosabilitymust hence be performed through the Behavior Automatonand its corresponding diagnoser. The R1-short-circuit Behav-ior Automaton and R2-opened-circuit Behavior Automata aregiven in Figure 3 and 4, respectively. Table 3 illustrates themapping between the occurrence of events related to residu-als and the resulting residual values in every state of the R2-


er201

N2N1

q10 q20

q12 q22

The fault mode R2 opened circuit

q11 q21

e r8

und 0

f3 f3

er201

on

off

q13

q14

e r11

und 0

e r8

und 0

eund 0

e r11

und 0 r8

on

off

on

off

off

on

on

on

Figure 4: The R2-opened-circuit-Behavior Automaton ofCase 2

( N1, ) ( N2, )

(q11, f 3 )

(q13, f 3 )

er8

und 0

er11

und 0

(q14, f 3 )

on

off

(q12, f 3 )

(q21, f 3 )

er8

und 0

(q22, f 3 )

er201

er11

und 0 er8

und 0

e r2

01

(q31, f 2 )

(q71, f 4 )(q51, f 1 )

(q32, f 2 )

(q34, f 2 )

er9

und 0

(q33, f 2 )

er8

und 0

er9

und 0

(q81, f 4 )(q61, f 1 )(q41, f 2 )

(q42, f 2 )

off

on

off

on

on

on

on

on

Figure 5: Part of the all System Diagnoser for Case 2

opened-circuit Behavior Automaton. The diagnoser has beenbuilt and part of it is given in Figure 5.The diagnoser shows that the hybrid system is not diag-nosable according to the hybrid diagnosability condition ofProposition 3 because there is an uncertain ”on-off” cy-cle containing theF2-uncertain andF3-uncertain diagnoserstates. However, from the state(q12, f3), (q32, f2) ofthis cycle, there are paths leading to the determined states(q14, f3) and(q34, f2). (q12, f3), (q32, f2) isreachable by transitioning on an event which corresponds toa command ”sw=off”, which means that it is controllable. Inan ambiguous diagnosis situation of this cycle, an active di-agnosis action, namely the ”off” command, can hence be ap-plied to reach a diagnosable configuration (in which the faultcontinuous signatures of the modes R1-short-circuit and R2-opened-circuit become different).From the point of view of active diagnosis, it is importantto notice the difference between an uncertain diagnoser cy-cle with and without command events. The latter is defini-tively not diagnosable but the former can be turned diagnos-able when reconfiguration is permitted.

8 Discussion and conclusions

This paper proposes a framework for analyzing the diagnos-ability of a hybrid system which stands on recent resultsestablishing the formal equivalence of diagnosability defi-nitions for DES and CS. The approach relies on mergingthe fault signatures exhibited at the continuous level into theMode Automaton that represents the discrete dynamics of thesystem, so that DES diagnosability analysis can be performedon the resulting Behavior Automaton and the correspondingdiagnoser. When the state of the system is ambiguous, ananalysis of the diagnoser allows us to point at reconfigurationactions that safely move the system into a mode reducing am-biguity. To our knowledge there is no existing work propos-ing a method to analyze the diagnosability of a hybrid system.The method that we propose interprets the continuous dynam-ics of the system in terms of events and gives a procedure tomerge this knowledge into the discrete dynamics model. Ourapproach can be related to the work by Lunze which usesQuantized Automata[Lunze, 2000a] [Lunze, 2000b]. Lunzestarts with a continuous system and discretizes the continu-ous variable value domains. From this discretization, he isable to produce a behavior automaton that accounts for allthe variable value switches. The behavior automaton that heproduces is hence oriented towards behavior prediction andsimulation purposes and its semantics are quite different fromthe behavior automaton that we produce. In our case, we havepursued the goal to obtain the same behavior automaton asused by the model-based DES diagnosis community[Sam-pathet al., 1995] [Puig et al., 2005] [Lamperti and Zanella,2002], so that their results can then be applied as so. Forthis purpose, the abstraction of the continuous dynamics isperformed from the continuous subspaces that characterizethe different modes of the system and the switches under-gone by the system state. The subspaces are generated bythe ARRs and the switches correspond to value switches fortheir corresponding residuals. In this framework, fault signa-


tures are uniquely defined, which is not the case when basingthe abstraction on a state variable value partitioning of thestate space as used by Lunze[Lunze, 2000a]. This paper is incontinuation with the work done by the French Imalaia group[Cordieret al., 2004] and the Bridge Task Group within theMONET network of Excellence. It hence uses the knowl-edge and results obtained by these two groups and establishesyet another bridge between two model based communities,namely the continuous and the DES model based communi-ties. Future work will be devoted to the problem of using di-agnosability assessment for selecting the best reconfigurationaction. This problems goes beyond selecting and applyinga discrete action. Indeed, some physical constraints may re-quire to plan a sequence actions and the hybrid nature of thesystem may call for hybrid control.

References[Benazera and Trave-Massuyes, 2003] E. Benazera and

L. Trave-Massuyes. The consistency approach to theon-line prediction of hybrid system configurations.IFACConference on Analysis and Design of Hybrid SystemsADHS’03, Saint-Malo (France), 2003.

[Benazeraet al., 2002] E. Benazera, L. Trave-Massuyes, andP. Dague. State tracking of uncertain hybrid concurrentsystems. In13th International Workshop on Principlesof Diagnosis DX’02, pages 106–114, Semmering, Austria,2002.

[Consoleet al., 2000] L. Console, C. Picardi, and M. Rib-audo. Diagnosis and diagnosability analysis using PEPA.In Proceedings of 14th European Conference on ArtificialIntelligence ECAI’00, pages 131–135, Berlin, Germany,2000.

[Cordieret al., 2004] M.O. Cordier, P. Dague, F. Levy,J. Montmain, M. Staroswiecki, and L. Trave-Massuyes.Conflicts versus analytical redundancy relations : A com-parative analysis of the model-based diagnostic approachfrom the artificial intelligence and automatic control per-spectives. IEEE Transactions on Systems, Man and Cy-bernetics, Part B., 34(52163-2177), 2004.

[Cordieret al., 2006] M.O. Cordier, L. Ttave-Massuyes, andXavier Pucel. Comparing diagnosability criterions in con-tinuous systems and descrete events systems. InProceed-ings of the17th International Workshop on Principles ofDiagnosis DX’06, Burgos, Spain, 2006.

[Frisket al., 2003] E. Frisk, D. Dustegor, M. Krysander, andV. Cocquempot. Improving fault isolability properties bystructural analysis of faulty behavior models: applicationto the DAMADICS benchmark problem. InProceedingsof IFAC Safeprocess’03, Washington, USA, 2003.

[Hofbaur and Williams, 2004] M.W. Hofbaur and B.C.Williams. Hybrid estimation of complex systems.IEEETransactions on Systems, Man, and Cybernetics - Part B.,34(5):2178–2191, 2004.

[Lamperti and Zanella, 2002] Gianfranco Lamperti and Ma-rina Zanella. Diagnosis of discrete-event systems from un-certain temporal observations.Artif. Intell., 137(1-2):91–163, 2002.

[Lunze, 2000a] Jan Lunze. Diagnosis of quantized systemsbased on a timed discrete-event model.IEEE Transactionson Systems, Man, and Cybernetics, Part A, 30(3):322–335,2000.

[Lunze, 2000b] Jan Lunze. Diagnosis of quantized sys-tems by means of timed discrete-event representations. InHSCC, pages 258–271, 2000.

[Pencole, 2004] Y. Pencole. Diagnosability analysis of dis-tributed discrete event systems. InProceedings of the16th Eureopean Conference on Artificial Intelligence,ECAI’2004, pages 43–47, 2004.

[Puiget al., 2005] V. Puig, J. Quevedo, T. Escobet, andB. Pulido. On the integration of fault detection and iso-lation in model based fault diagnosis. InProceedings ofthe 16th International Workshop on Principles of Diagno-sis DX’05, pages 227–232, 2005.

[Roze and Cordier, 2002] L. Roze and M.-O. Cordier. Diag-nosing discrete-event systems : extending the “diagnoserapproach” to deal with telecommunication networks.Jour-nal on Discrete-Event Dynamic Systems : Theory and Ap-plications (JDEDS), 12(1):43–81, 2002.

[Sampathet al., 1995] M. Sampath, R. Sengputa, S. Lafor-tune, K. Sinnamohideen, and D. Teneketsis. Diagnosabil-ity of discrete-event systems.IEEE Transactions on Auto-matic Control, 40:1555–1575, 1995.

[Struss and Dressler, 2003] P. Struss and O. Dressler. A tool-box integrating model-based diagnosability analysis andautomated generation of diagnostics. InProceedings ofthe 14th International Workshop on Principles of Diagno-sis DX’03, Washington DC, US, 2003.

[Trave-Massuyeset al., 2004] L. Trave-Massuyes, T. Esco-bet, and X. Olive. Diagnosability analysis based on com-ponent supported analytical redundancy relations.Rap-port LAAS N04080, To appear in IEEE Transactions onSystems, Man and Cybernetics, Part A, 2004.

[Williams and Nayak, 1999] Brian C. Williams and P. Pan-durang Nayak. A model-based approach to reactive self-configuring systems. In Jack Minker, editor,Workshopon Logic-Based Artificial Intelligence, Washington, DC,College Park, Maryland, 1999. Computer Science Depart-ment, University of Maryland.


Abstract This paper proposes to use the informational en-tropy to analyze chronicle models discovered by a stochastic approach [Le Goc et al, 2005, 2006]. The aim is to define a method and an algorithm to analyze a chronicle model used to predict the oc-currence of a particular discrete event class in a di-agnosis task of dynamic system. This is a classifi-cation problem of discrete event occurrences se-quences. This problem is hardly tackled in the lit-erature, especially when the resulting model must be interpretable by human. We propose an algo-rithm based on an informational entropy criterion that constructs a continuous time decision tree from which a chronicle model is deduced. The latter or-ders the main discrete event class that suffers to predict the occurrence of a particular discrete event class. In this paper we show on an example how such an entropic criterion completes the stochasticapproach and provides an operational tool to ana-lyze a given chronicle model.

1 Introduction This paper presents an algorithm to analyze a chronicle model provided by an expert or, as in our case, generated by the BJT algorithm [Le Goc et al, 2005], which is based on a stochastic modeling of continuous time discrete event se-quences. The chronicle models generated by the BJT algo-rithm are used in a diagnosis task in order to predict the oc-currence of a particular discrete event class. A chronicle model is a set of binary relations between discrete event classes that are timed constrained. The aim of this work is to define a method and a tool to analyze the contribution of the discrete event classes of such chronicle models to predict the occurrence of a particular discrete event class. For that purpose, we propose to use an informational entropic crite-rion to build a continuous time decision tree that describes the way the occurrences of a particular discrete event class are generated. Our approach is inspired from the Tempo-ralID3 algorithm of [Console and Picardi, 2003]. The principle of our algorithm, called CTID3 for ID3 with Continuous Time values, uses a chronicle model to build a set of sequences that labeled OK or KO according to the fact

that they lead or not to an occurrence of a given target dis-crete event class. The algorithm works in three steps: (1) the representation of the sequences under the form of a continu-ous time decision table, (2) the construction of continuous time decision tree, and (3) the deduction of a chronicle model from the temporal decision tree. This chronicle model specifies the discrete event classes that contribute to the prediction of the occurrence of the target class. This pro-vides then a mean to analyze the initial chronicle model. In this paper, the method is assessed with a set of theoretical sequences corresponding to the prediction of a certain event class B. The next section discusses the researches in the temporal knowledge discovery domain and the related works. Section 3 presents the definition of decision tree and describes the well-known algorithm ID3 [Quinlan, 1986], we introduce the extension to timed data proposed in [Console and Picardi, 2003]. The section 4 describes the problem state-ment. Our algorithm, CTID3, is presented in section 5 with a theoretical example. The conclusion is dedicated to pre-sent our future works.

2 Related Works Classification is one of the most typical tasks in supervised learning, but hasn’t received much attention in temporal domain [Antunes and Oliveira, 2001]. In fact, there is rela-tively few applications based on temporal classification in the literature, notably when the result model must be inter-pretable by humans. The main contributions are decision trees [Breiman and al, 1984; Murthy, 1998] that are largely used for the classification of the temporal data [Kadous, 1999], [Drucker and Hubner, 2002], [Rodriguez and Alonso, 2004]. Its popularity comes from the capacity to produce interpretable, comprehensible classification models. Timed data is still a basic problem with the knowledge discovery for model based diagnosis applications. The use of temporal knowledge is required in a large range of domains, from engineering to medicine or marketing for example. In the engineering domain, the diagnosis of a dynamic sys-tem such as telecommunication networks or industrial pro-duction systems aims at predicting the malfunctions from the flow of alarms generated by the equipments of the su-pervision system. These alarms must be filtered in order to

Towards an Entropic Approach for the Analysis of Chronicle Models

Nabil Benayadi, Marc Le Goc and Philippe Bouché. Laboratoire des Sciences de l'Information et des Systèmes - LSIS UMR CNRS 6168 - Université Paul Cézanne Aix-Marseille III

Avenue Escadrille Normandie Niemen13397 Marseille Cedex 20 – France nabil.benayadi, marc.legoc, [email protected]


preserve only those which are the most interesting. [Mannila and al, 1997] proposes a set of methods allowing the extrac-tion of frequent patterns (called episodes) from a sequence of alarms. Episodes are said to be parallel when the alarms are not ordered and sequential when they are completely ordered. These methods are inspired of those developed in the marketing domain. [Agrawal and Srikant, 1994] defined an approach allowing the extraction of the sequential pat-terns over a large database of customer transactions in order to identify the groups of the most sold articles, where each transaction consists of a customer-id, a transaction time and the items bought in the transaction. They proposed three algorithms (AprioriAll, AprioriSome and DynamicSome) to solve this problem, all being extensions to temporal data of the well-known Apriori algorithm [Agrawal and al, 1993]. This approach was applied to the analysis of the alarms gen-erated by a telecommunication network [Hatonen and al, 1996a, 1996b]. The temporal constraint is uniquely a maxi-mum global interval (observation window) covering the duration of the episodes. This interval must be defined by the user, whereas the temporal constraints between the ele-ments of the episodes are completely ignored. [Cordier and Dousson, 2000] and [Dousson and Duong, 1999] propose an approach to discover chronicles models from a log of timed alarms that aim at exhibiting recurring phenomena. This work completes the one of Mannila with the introduction of the temporal constraints between alarms. However, the alarm’s sequentiality is not directly expressed in this ap-proach, and the temporal constraints are calculated with the heuristic ad hoc. Manilla and al [2002] made a conclusion about the main results obtained using Apriori-Like algo-rithms and characterize that these algorithms identify very local relations between events. To avoid this problem, Man-nila invites to use other algorithms by adapting a more global point of view, and proposes to investigate in Markov theory. [Le Goc et al, 2005, 2006] propose a stochastic ap-proach that aims at building chronicle models from a se-quence of discrete event classes generated by knowledge based system. This approach is based on the representation of a sequence of discrete event classes in the dual forms of a Markov chain and a superposition of Poisson processes. This approach provides global chronicle models with opera-tional timed constraints that can be used in a prediction task for diagnosis. This chronicle’s models can contain events which have a maximum probability in the markov chain, but do not have any sense with the event to predict. So, this ap-proach requires the tools to analyze the contribution of the occurrence of the discrete event of which lead to the one of the class to predict. An informational entropic criterion can be used to this aim. Such a criterion is used to build decision tree [Quinlan, 1986]. [Geurts and Wehenkel, 1998] proposes an extension of binary these decision trees to timed data, called temporal decision trees, for the early prediction of electric power system blackouts, using a large data base of random power system scenarios generated by Monte-Carlo simulation. Each scenario is described by temporal variables and sequences of event describing the dynamic of the sys-tem as it is observed from real time measurement. The aim

is to derive as simple as possible models to detect a blackout problem in the system. More recently, [Console and Picardi, 2003] used temporal decision trees to diagnose the behavior of the dynamic systems in order to decide possible correc-tion actions to carry out. Inspired with [Geurts and Wehen-kel, 1998], they extend the ID3 algorithm [Quinlan, 1986] to temporal data, called TemporalID3, in order to produce compact n-ary temporal decision trees from a set of prob-lematic situations. We propose to adapt this algorithm with the aim of using an informational entropic criterion to ana-lyze a chronicle model.

3 Temporal Decision Trees Decision trees are used to implement classification problem solving methods that can be used in diagnostic tasks. Each node of the tree corresponds to a variable. A node can have as many descendants as the number of the values taken by the variable. The leafs correspond to the different decisions. Formally, a decision tree is a structure T= <r, N, E, L> where: • N=NI∪NL is the union of a set NI=xi of internal nodes

indicating a variable xi and a set NL=ai of leaf nodes indicating a decision ai,

• r∈N is the node root of the tree,

• E ⊆ NI×N is a set of arcs (an arc corresponding to a valuevj for a variable xi),

• L is a labeling function defined over N∪E which returns:

• The name of the variable xi associated to a node NI, or,

• The decision ai associated to a leaf node NL, or ,

• The value vj associated to an arc of E.

The ID3 algorithm uses the informational entropy in a set Σof cases to build a minimal decision tree. A case s is a col-lection of values vj taken by a set of variables x leading to a particular decision ai. To each node, ID3 chooses the vari-able x which minimizes the entropy ξ(x, Σ) in the set of the cases Σ:

sinvvaluehasxiablethes

aPaP

vxPx

jvx

vxi

n

ivxivx

vx

k

jj

j

jjj

j

var

));((log);()(

)()(),(

21

1

Σ∈=Σ

Σ×Σ−=Σ

Σ×==Σ

=

==

==

==

∑

∑

ξ

ξξ

(1)

where ξ(∑|x=vj) is the entropy of the partition ∑|x=vj and P(ai; ∑|x=vj) is the probability of the decision ai in this partition ∑|x=vj.

Table 1: Example of Temporal Decision Table

a2

a1

Décision Dl

x3x2x1

t2nnhnnhvhhs2

t3vvvlnnhhnnvns1

t3t2t1t0t3t2t1t0t3t2t1t0

a2

a1

Décision Dl

x3x2x1

t2nnhnnhvhhs2

t3vvvlnnhhnnvns1

t3t2t1t0t3t2t1t0t3t2t1t0


According to [Console and Picardi, 2003], a temporal deci-sion tree is a decision tree where a node is a couple (xi, tk), xi indicating a variable and tk is the observation date of its value, and an arc define a value vj of xi at the date tk (i.e. xi(tk)=vj). It is a structure T= <r, N, E, L, Tp> endowed with a time-labeling function Tp: NI→ℜ+ which gives the asso-ciated date to a internal node. The training set is a collection of situations S= se=0,..,m. A situation se is the set of the val-ues vj taken by a set of variables X= xi at every observa-tion date tk leading to a particular decision an (Table 1). A situation se refer to a discrete time clock where tk≡kT, k∈ℵand T∈ℜ+, T is a period of discretization. The variables x1, x2, x3 of Table 1 take the qualitative values n, h, l, or v at observation dates t0, t1, t2, t3. In the first situation s1 (resp. s2), the decision a1 (resp. a2) must be taken at least at the date t3 (resp. t2). This date is called "limit" because the knowledge of the variable’s values beyond this date is useless for the decision-making. TemporalID3 is an extension of ID3 to timed data according to a discrete time clock structure to build temporal decision trees [Console and Picardi, 2003]. A partition Pe is a subset of S containing identical situations on an time interval: ∀tk∈[tmin, tmax], ∀x∈X, ∀si, sj ∈ Pe, Ttd[si, x, tk]=Ttd[sj, x, tk] ⇒ si≡sj. Thus, at every observation date t, S is partitioned in a set of partitions P≈,t= Pe =0,.., m :

[ ][ ] [ ]

ellekjki

ejiekem,...,1,0et,

Ps)s(Dlmin)P(Dlt,x,sTtdt,x,sTtd

,Xx,Ps,s,)P(Dl,tt,SPPP

∈=∧=

∈∀∈∀∈∀∈∀==≈

(2)

TemporalID3 builds a tree by seeking a time interval which maximizes a criterion related to the number of partitions. Then, like ID3, TemporalID3 chooses the couple (xi, t) that minimizes the entropy criterion in this interval (12) and cre-ates the corresponding node.

∑ = =×==k

1j v)t(xjii )S()v)t(x(P)S),t,x((ji

ξξ (3)

All the values of the variables at every date before T are eliminated from the table, including xi(t), then TemporalID3 repeats its treatment until two terminate conditions are met: all the situations of S are classified or, there are no valid observations x(t) for splitting S.

4 Problem statement In this section, we give the definition of the problem of ana-lyzing chronicle models. First, we give a brief overview of the spatial discretization principle and the notion of chroni-cle model based on the formal description of event classes introduced in [Le Goc, 2004] and extended in [Le Goc and al, 2005]. Second, we examine the problem of the chronicles models generated by the stochastic approach. A sequence ωn=okk=0,…,m-1, ωn ∈ Ω, is an ordered set of m occurrences ok≡(tk, x, rm) of discrete event ek≡(x, rm) where x∈X is the name of a real variable, rm∈Rx=rkk=0,…,n, is a interval index of values for x(t), and tk∈Γ=ti, ti∈ℜ, is the time assignation of the index rm to a variable x, x(tk)= rm. The occurrences are timed with a continuous clock structure (i.e. tk-2-tk-1 ≠ tk-1-tk) :

),()()(

,,

mr x,

kt

ko

mr

ktx

mr

1-ktx

kt

1-kt

xR

mr

kt

≡⇒∈∧∉

<∃∈∀ℜ∈∀ (4)

Let Evt=ekk∈ℵ be a set of event defined over X*R and Γ=titi∈ℜ a set of the occurrences times defined over ℜ, we note Eo=okk∈ℵ, ok ≡ (tk, x, rm), a set of occurrence of dis-crete event defined over Γ*Evt. Let ‘d’ be a function that provides the date of an occurrence:

kkomkko toE)r x,,(toΓ,E : =∈≡∀→ )(d,d (5)

A sequence ωi=okk=0,…,m-1 defines a subset Γωi=tjof dates, so Ω=ωi i=0,…n-1 defined its subset Γ=∪Γωi=0,..,n :

Γ⊆Γ∈⇒∈∈∀iωkiki oωoΩ,ω )(d (6)

A couple (ok, ok+1) of two successive occurrences related to the same variable x describes the temporal evolution of the discrete function x(t), defined on ℵ :

[ [ n1km1kk1kk

n1k1kmkk

rtxrtxtttoo

ωrx,,torx,,to

=∧=∈∀⇒∈≡≡∀

+++

++

)()(,,),(

,)(),( (7)

A discrete event class is a set Cj=ei of discrete events ei≡(x, rm). We will use the notation “ei::C

j” to denote that the discrete event ei belongs to the class Cj. By extension, we will note “ok::C

j” an occurrence of a discrete event that be-longs to the class Cj. A binary relation R(Ci, Co, [τ-, τ+]) describes an oriented relation between two discrete event classes that is timed constrained. “[τ-, τ+]” is the time win-dows for observing an occurrence of the output class Co

after the occurrence of the input class Ci. [ ]

[ ]))(d)(d()()(

,)(+−

+−

∈−∧∧

Ω⊆∈∃⇔

τ,τooC::oC::o

o,oτ,τ,C,CR

kni

ko

n

knoi ω (8)

A chronicle model is a set of binary relations with timed constraints between classes of discrete events. For example, a chronicle model M3=R12(C

1, C2, [τ12-, τ12

+]), R23(C2, C3,

[τ23-, τ23

+]) defines two binary relations between three dis-crete event classes and means that there exists at least three occurrences in Ω so that :

[ ] [ ]))(d)(d())(d)(d(

)C::)C::)C::

,321

++ ∈−∧∈−∧

∧∧

Ω∈∃

23-23nm12

-12kn

mnk

mnk

τ,τooτ,τoo

(o(o(o

oo,o (9)

Such chronicle models can be used to predict the occurrence of the final event classes, like C3 in the M3 chronicle model. To this aim, rules like the following can be used in a diag-nosis task:

[ ][ ]),)o(d)o(d()C::o(,o

),)o(d)o(d()C::o()C::o(

,o,o,

2323nm3

mm

1212kn2

n1

k

nk

+−

+−

∈−∧∈∃⇒

∈−∧∧

∈∀Ω⊆∀

ττω

ττ

ωω (10)

[Le Goc and al, 2005] proposes an algorithm called BJT to discover such chronicle models from a sequence of discrete event occurrences. When the occurrences of the discrete event classes are independent and uniformly distributed according to the exponential distribution, a sequence can be modeled under the dual form of a homogeneous continuous time Markov chain and its corresponding superposition of Poisson processes. The BJT algorithm uses these two repre-sentations to build chronicle models. This constitutes the basis of the stochastic approach (see [Le Goc et al, 2006] to a global discussion on the stochastic approach).


The anticipation ratio is a criterion allowing the measure-ment of the quality of chronicle models produced by the stochastic approach. This ratio is defined like the number of sub-sequences respecting the sequential and temporal con-straints of a chronicle model who are called “sequence Ok” (the positive predictable occurrences of target-class), noted ωn

ok, and the number of sub-sequences respecting the con-straints of this model private of the last binary relation con-cerning the class to be predicted who are called “sequence Ko”( the false predictable occurrences of target-class), noted ωn

ko (false events):

)nbr()nbr(

)nbr(ratioonanticipati

kon

Okn

Okn

ωω

ω

+= (11)

For example, let us consider a sequence ω0 of the class occurrences C=A, B, C, D describing the evolution of fourth variables X=v_a, v_b, v_c, v_d in a given time period: ω0 = (0.8774, B), (1.9313, A), (2.8625, C), (3.8718, A), (4.4063,

B), (4.7837, D), (6.0282, B), (6.0874, C), (6.2531, A), (8.0034, D), (8.4572, A),

(9.2311, A), (9.4742, C), (9.5447, B), (9.8285, A), (10.6631, B), (11.3967, A),

(12.9826, B), (13.4464, A), (13.621, B), (13.7333, C), (14.1756, D), (14.6806,

A), (15.9598, B), (16.0240, A), (16.7736, C), (18.2447, D), (18.3639, A),

(18.9228, B), (19.0749, A), (19.3406, C), (21.1271, B), (21.3377, A), (22.6778,

A), (23.0197, D), (23.8478, C), (24.0392, B), (24.3164, A), (25.7974, A),

(26.1294, C), (26.5961, A), (26.882, B), (28.3387, B), (28.9188, D), (29.1697,

A), (29.7840, B), (31.1968, C), (31.2985, A), (32.1786, A), (33.3798, B),

(33.798, A), (33.8452, C), (34.6423, B), (35.1186, A), (36.0294, A), (37.0752,

A), (37.4236, B), (37.4543, D), (38.31, B), (39.0201, C), (39.1335, A),

(40.2009, B), (41.1502, A), (41.992, C), (42.5301, A), (43.0774, B), (43.4958,

A), (43.8625, B), (46.3524, C), (49.6673, C).Let us consider also a chronicle model deduced for this sequence with a stochastic approach (Figure 1).

A C A B

[ ]14.2,0 [ ]642.1,0[ ]51.1,0

Figure 1: Chronicle Model for B class

This chronicle model defines a training set Ω=ωn contain-ing 8 subsequences of ωn

ok and 16 subsequences ωnKo

(Figure 2).

Figure 2: Sequences Ok (top) and Ko (down).

Ideally, a chronicle model would have a ratio of 100% (nbr(ωn

Ko)=0). Practically, because occurrences are gener-ated according to a Poisson distribution, this ratio can not be 100%. [Le Goc and al, 2005] considers that a chronicle model with an anticipating ratio greater than 50% is opera-tional for a diagnosis tasks. Our aim is then to define the means to analyze and eventually to improve the anticipating ratio of an operational chronicle model. The idea consists in using a temporal supervised classifica-tion method to identify the discrete event classes of a given chronicle model that allows to decide wether a sequence is OK or not. The method must provide a that must be: (1)

temporal to take into account the temporal nature of the oc-currences, (2) distinctive to characterize only the sequences Ok, (3) interpretable to be able to explain the reasons of the decision. The Temporal Decision Trees of [Console and Picardi, 2003] based on [Geurts and Wehenkel, 1998] pre-sents these properties but must be adapted to data that are timed according to a continuous clock structure.

5 Continuous Time ID3 This section introduces the CTID3 algorithm (Continuous Time ID3) to classify a set Ω=ΩOk∪ΩKo, where ΩOk=ωn

Ok contains a set of sequences Ok and ΩKo=ωn

Ko contains a set of sequences Ko. The CTID3 works in three-stages: (1) the sequence representation in a temporal decision table, (2) the construction of temporal decision tree, (3) the extraction of the chronicle model from the temporal decision tree. This section details these stages.

5.1 Sequences representation The objective of this stage is to build a decision table similar to Table 1 containing a set of training sequences Ω=ΩOk∪ΩKo. This consists mainly in passing from data timed with a continuous time clock structure to data timed with a discrete time clock structure. The first step re-dates the occurrences of a sequence in a time relative to the date of its last occurrence. Let tmax be the date of the last occurrence in a sequence ωi defining a set Γωi of dates: imaxmaxt ωΓ∈= = kn0,...,k tt . The new

time of an occurrences in a sequence ω is given by: )(dmaxt)(d,i kkk ooo −=∈∀ ω

Now, the sequences of Ω can be analyzed from the end to the beginning in to the opposite direction of time so that the limit date Dl(ωn) of a sequence ωn is the largest date in a re-dated sequence: Dl(ωn)= maxtk| ∃ o∈ωn, d(o)=tk. By ex-tension, the global limit date Dl(Ω) of a sequences set Ω=ωn is the smallest limit date : Dl(Ω) = min Dl(ωn) | ωn,∈ Ω.

Table 2: Continuous Time temporal Decision Table

2.71KoC?C?…DDC?

C?…CC?C?C?…BC?C?A…C?C?ωωωω24

……………………………………………………………

1.64KoDC?C?C?CC?…CC?BC?…BBAA…AC?ωωωω2

2.47OkDD…C?C?C?

C?…C?C?C?C?…C?BC?A…C?C?ωωωω1

4.54.1…0.10.04.5

4.1…0.10.04.54.1…0.10.04.54.1…0.10.0

DiDecv_dv_cv_bv_aΩΩΩΩ

2.71KoC?C?…DDC?

C?…CC?C?C?…BC?C?A…C?C?ωωωω24

……………………………………………………………

1.64KoDC?C?C?CC?…CC?BC?…BBAA…AC?ωωωω2

2.47OkDD…C?C?C?

C?…C?C?C?C?…C?BC?A…C?C?ωωωω1

4.54.1…0.10.04.5

4.1…0.10.04.54.1…0.10.04.54.1…0.10.0

DiDecv_dv_cv_bv_aΩΩΩΩ

The set of sequences Ω=ωn defines a set X=∪Xωi=0,..,n of variables xi and a set Γ=∪Γωi=0,..,n of observation times tk. In the example of the section 2.2 (Figure 2), X contains 4 vari-ables and Γ 174 observation times. The construction of a temporal decision table as Table 1 requires to know the val-ues of each of the variables xi of X at each observation time tk of Γ. But because the equation (4) does not allows to de-duce these values from a given sequence ωi, the decision table must be completed with event occurrences of the form ok≡(t,x, ?), or ok::C

? where «C?» indicates the class of un-known values (Table 2).

1.93

A

1.93

A

2.86

C

2.86

C

3.87

A

3.87

A

4.40

B

4.40

B

22.67

A

22.67

A

23.84

C

23.84

C

24.03

B

24.03

B

24.31

A

24.31

A

23.01

D

23.01

D


5.2 Continuous Time Temporal Decision Tree The continuous time temporal decision tree associated with the discrete event sequences set Ω=ωn can then be built by applying the TemporalID3 algorithm to the continuous time temporal decision table (Table 2). Note that the Tem-poralID3 algorithm has a number of prerequisites, and two of these are relevant in our case: the compatibility criterion that guarantees the recognition of the sequences before the “limit date” of the sequences and the minimal entropy crite-rion that guarantees the minimization of the size of the re-sulting temporal decision tree. The proofs of the Tempo-ralID3 algorithm are provided in [Console and Picardi, 2003].

C?⟨v_a, 0.092626⟩

⟨v_c, 1.08452⟩

Ok Ok

⟨v_c, 0.68979⟩

A

C C ?

⟨v_a, 1.471358⟩

⟨v_c, 1. 747701⟩ ⟨v_c, 1.784185⟩

OkOk

C

A

C

C?

Ko

KoKo

C ?

C?CC?

BC?

⟨v_a, 0.092626⟩

⟨v_c, 1.08452⟩

Ok Ok

⟨v_c, 0.68979⟩

A

C C ?

⟨v_a, 1.471358⟩

⟨v_c, 1. 747701⟩ ⟨v_c, 1.784185⟩

OkOk

C

A

C

C?

Ko

KoKo

C ?

C?CC?

B

Figure 3 : Continuous Time Temporal Decision Tree

Applied to Table 2, TemporalID3 builds the continuous time temporal decision tree of Figure 3. This tree provides the different decisions that lead to an occurrence of a B dis-crete event class according to Ω. A decision corresponds to a node and an output arc that specify a triplet of the form (xj, Cj, tk) where xi is the name of a variable, Cj denotes a dis-crete event class and tk the maximum time an instance of Cj

must occurs.

5.3 Chronicle Model Extraction A suite of such decisions can be used to analyze the corre-sponding chronicle model. To this aim, the idea is to trans-form this tree into a set of chronicle models. To deduce the chronicle model from a continuous time tem-poral decision tree, we first prune the branches leading to a decision Ko. Next, we prune the branches with at least one arc labeled with the C? class. This provides a tree with branches that can be interpreted as a set of binary relations between discrete event classes. The time constraints are then estimated from the limit time of two successive nodes and the average inter occurrences delay of the corresponding discrete event classes. Because a continuous time temporal decision tree is built from the decision table where the time is reversed, the construction of the model goes from the leaves to the root according to the following algorithm: • For each arc (n1n2)∈E of the tree where n2 is the child of

n1, we create a node in the chronicle model whose thename is the event class L(n1n2).

• For each couple of nodes (n_c1, n_c2) of the chronicle model which is created respectively from two successive arcs (n2n3) (n1n2), an arc creates the arc (n_c1, n_c2) whose value is a temporal interval [τ1, τ2] calculated in the following way (where λ provides the average inter occurrences delay of a discrete event class):

• τ1 = (d(n2)-d(n1))-λ(n_c2).

• τ2 = (d(n2)-d(n1))+ λ(n_c1).

This algorithm produces the chronicle model of Figure 4 with the bold branch ((v_a, 0.092626)→(v_c, 0.68979)→(Ok)) of the tree of Figure 3.

Figure 4: New Chronicle Model

It can be seen that he relation R(A, C, [0, 2.14]) of the model of Figure 1, produced by a stochastic approach, no more appears in the new chronicle model produced by the entropic approach. This induces the idea that this relation brings only little information to predict an occurrence of the target class B. The temporal constraints are stronger because Ω is made up only of sequences respecting the timed con-straints of the stochastic model.

6 Conclusion This article proposes the algorithm CTID3 to compute chronicle models from a set of discrete event sequences. This algorithm transforms a set of sequences of discrete event classes in continuous time temporal decision table that allows applying the Temporal ID3 algorithm of [Consolé and Picardi, 2003] to construct a continuous time temporal decision tree. Such a tree describes the suite of decisions that leads to the occurrence of a particular discrete event class. Because a decision in such a tree corresponds to an occurrence of a discrete event class, CTID3 can build a set of chronicle models describing the relations between dis-crete event classes that contribute to the occurrences of a given discrete event class. When the set of discrete event class is provided by the mean of a given chronicle model, the chronicle models produced by the CTID3 algorithm can be used to analyze the given chronicle model. The main advantage of this approach is that the entropic criterion allows to identify the event classes that are the most significant in order to predict the occurrence of a par-ticular event class. This property is particularly important when the chronicle models are provided with learning algo-rithms like the one used in the stochastic approach of [Le Goc et al, 2005]. This first result invites us to combine the entropic and stochastic approaches to discover temporal knowledge with a stronger predictive power. The CTID3 algorithm is implemented and it will be applied for the dis-covery of the roads models of the St-Microelectronics com-pany.

C A B[ ].52,0.790 [ ].660.47,0

C A BC A B[ ].52,0.790 [ ].660.47,0


References [Agrawal and al, 1993] Agrawal Rakesh, Imielinski,

Tomasz, and Swami Arun. Mining Association Rules be-tween sets of Items in Large Databases. Proc. ACM SIGMOD Int’l Conf. on Management of Data, p. 207-216 Mai 1993

[Agrawal and Srikant, 1994] Agrawal Rakesh and Srikant Ramakrishnan. Mining Sequential Patterns, Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. Expanded version available as IBM Research Report RJ9910, October 1994.

[Antunes and Oliveira, 2001] Antunes Claudia and Oliveira Arlindo. Temporal Data Mining: an Overview. In KDD Workshop on TemporalData Mining, pages 1–13, San Francisco, 2001.

[Breiman and al, 1984] Breiman Leo, Friedman Jerome H., Olshen R. A., and Stone C. J. (1984). Classification and Regression Trees. Belmont, CA: Chapman & Hall.

[Console and Picardi, 2003] Console Luca, Picardi Claudia, and Dupré D. Theiser. Temporal Decision Trees: Model-based Diagnosis of Dynamic Systems On-Board. Journal of Artificial Intelligence Research, 19, pp 469-512. 2003.

[Cordier and Dousson, 2000] Cordier, M.O., and C. Dous-son. Alarm Driven Monitoring Based on Chronicle. Pro-ceedings of SafeProcess 2000, pages 286-291, Budapest, Hungary, 2000.

[Dousson and Duong, 1999] Dousson Christophe and.VuDuong Thang. Discovering Chronicles with Numerical Time Constraints from Alarms Logs for Monitoring Dy-namic Systems. the 1-th International Join conference on Artificial Intelligence (IJCAI'99), pp. 620-626. 1999.

[Drucker and al, 2001] Drucker Christian, Hubner Sebas-tian, Visser Ubbo, and Weland H. Georg.. “As Time Goes by”- Using Time Series Based Decision Tree In-duction to Analyze the Behaviour of Opponent Players.RoboCup 2001: Robot Soccer World Cup V, LNAI. 2377 (pp. 325–330). Berlin: Springer-Verlag. 2001

[Frydman and al, 2001] Frydman Claudia, Le Goc Marc, Torres Lucile, and Giambiasi Norbert. Knowledge-Based diagnosis in SACHEM using DEVS models. Special Is-sues of Transaction of Society for Modeling and Simula-tion International (SCS) on Recent Advances in DEVSMethodology, Tag Gon Kim Ed., Volume 18, n°3, pages147-158, 2001

[Geurts and Wehenkel, 1998]. Geurts Pierre and Wehenkel Louis. Early prediction of electric power system black-outs by temporal machine learning. In Proc. Of ICML-AAAI’98 Workshop on “IA Approches to Times-series Analysis”, Madison (Wisconsin), 1998.

[Hatonen and al, 1996a]. Hatonen Kimmo, KlemettinenMika, Mannila Heikki, Ronkainen Pirjo, and ToivonenHannu. Knowledge discovery from telecommunication network alarm databases. In 12th International Confer-

ence on Data Engineering (ICDE ’96). New Orleans, LA, pp. 115–122. 1996.

[Hatonen and al, 1996b] Hatonen Kimmo, Klemettinen Mika, Mannila Heikki, Ronkainen Pirjo, and ToivonenHannu. 1996b. TASA: Telecommunication alarm se-quence analyzer, or how to enjoy faults in your network.In 1996 IEEE Network Operations and Management Symposium (NOMS ’96). Kyoto, Japan, pp. 520–529. 1996.

[Kadous, 1999] Kadous M. Walled. Learning comprehensi-ble descriptions of multivariate time series. Proceedings of the Sixteenth International Conference on Machine Learning, pp. 454–463. San Francisco: Morgan Kauf-mann. 1999

[Le Goc, 2004] Le Goc Marc. Sachem, a Real Time Intelli-gent Diagnosis System based on the Discrete Event Paradigm. Simulation, The Society for Modeling and Simulation International Ed., Volume 80, pages 591-617, 2004. 2004

[Le Goc and al, 2005] Le Goc Marc, Bouché Philippe, and Giambiasi Norbert. Stochastic Modeling of continuous Time Discrete Event Sequence for Diagnosis, in: 16th In-ternational Workshop on Principles of Diagnosis, DX-05, Pacific Grove, California, USA, 1-3 juin 2005.

[Le Goc and al, 2006] Le Goc Marc, Bouché Philippe, and Giambiasi Norbert. Temporal Abstraction of Timed Alarm Sequences for Diagnosis, In COGIS’06, Interna-tional Conference on Cognitive Systems with Interactive Sensors, Paris, France, March 15-17, 2006.

[Mannila and al, 1995] Mannila, H.; Toivonen, H.; Verkamo, I. “Discovering Frequent Episodes in Sequences” First International conference on Knowledge Discovery and Data Mining (KDD’95), pp.210-215 1995

[Mannila and al, 1997] Mannila Heikki, Toivonen Hannu, and Verkamo A. Inkeri. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Dis-covery, 1(3):259–289, 1997.

[Mannila, 2002] Mannila Heikki. “Local and Global Meth-ods in Data Mining: Basic Techniques and Open Prob-lems”. 29th International Colloquium on Automata, Languages and Programming, Volume n°2380, pages 57-68, Malaga, Spain, 2002

[Murthy, 1998] Murthy K. Sreerama. Automatic Construc-tion of Decision Trees from Data: A Multi-disciplinary Survey. Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389. 1998

[Quinlan, 1986] Quinlan J. Ross. Induction of decision trees. Machine Learning 1, 81-106. 1986

[Rodriguez and Alonso, 2004] Rodriguez Juan J. and Alonso Carlos J. Interval and Dynamic Time Warping-based Decision Trees. Symposium on Applied Comput-ing, Proceedings of the 2004 ACM symposium on Ap-plied computing . 548-552. 2004.


Distributed Diagnosis by using a Condensed Local Representation of the GlobalDiagnoses with Minimal Cardinality

Jonas Biteus∗

Dept. of Electrical EngineeringLinkopings universitet, Sweden.

Erik FriskDept. of Electrical EngineeringLinkopings universitet, Sweden

Mattias NybergPower-train DivisionScaniaAB, Sweden

AbstractThe set of diagnoses is commonly calculated inconsistency based diagnosis, where a diagnosis in-cludes a set of faulty components. In some appli-cations, the search for diagnoses is reduced to theset of diagnoses with minimal cardinality. In dis-tributed systems, local diagnoses are calculated ineach agent, and global diagnoses are calculated forthe complete system. The key contribution in thepresent paper is an algorithm that synchronizes thelocal diagnoses in each agent such that these repre-sent the global diagnoses with minimal cardinality.The resulting diagnoses only include faulty compo-nents used by the specific agent, and are therefore acondensed local representation of the global diag-noses with minimal cardinality.

1 IntroductionThis paper considers distributed systems that consist of aset of agents, where an agent is a more or less indepen-dent software entity, connected to each other via some net-work [Hayes, 1999; Weiss, 1999]. The diagnoses can, in dis-tributed systems, be divided into two different types, globaldiagnoses that are diagnoses for the complete distributedsystem and local diagnoses that are diagnoses for a singleagent[Rooset al., 2002].

It is an advantage to have the set of minimal global diag-noses in each agent. However, an agent has only an interestin knowing the fault status of the components used by thatagent since the other components does not affect the specificagent. Consider for example a global diagnosis that consistsof a set of components that have been found to be faulty. Anagent that uses some components in its operation is interestedin knowing if any of these components are included in theglobal diagnosis. The agent however does not have an in-terest in the fault status of the rest of the components in theglobal diagnosis. In some applications, the calculation of di-agnoses is focused on to some smaller set of diagnoses, forexample the most probable diagnoses[de Kleer, 1991] or thediagnoses with minimal cardinality[de Kleer, 1990].

∗Corresponding author. E-mail:[email protected]. Ad-dress: Vehicular Systems, Electrical Engineering, Linkopings uni-versitet,SE-581 83 Linkoping, Sweden. Phone: +46 13 281994.

The key contribution in the present paper is an algorithmthat synchronizes the minimal local diagnoses in one agentwith the minimal local diagnoses in the other agents, such thatthe result is a set of diagnoses with minimal cardinality. Eachresulting diagnosis is a subset of some global diagnoses, andonly components used by the specific agent is included theresulting diagnosis. Since only the components used by thespecific agent are included, both the size and the number ofthe resulting diagnoses with minimal cardinality are reducedcompared to the set of global diagnoses with minimal cardi-nality. The resulting diagnoses with minimal cardinality aretherefore a condensed local representation of the global di-agnoses with minimal cardinality, and are here denoted con-densed diagnoses with minimal cardinality. By reducing thesize and the number, the algorithm requires a low computa-tional load, low memory usage, and low network load. Thealgorithm is distributed such that it can handle both changesin the number of agents and the exchange of single agents.The algorithm is described in Section 5, using the frameworkfor distributed diagnosis presented in Section 3–4.

Our work has been inspired by diagnosis in distributedembedded systems used in automotive vehicles. These sys-tems typically consist of precomputed diagnostic tests thatare evaluated in different agents, which in the automotive in-dustry correspond to electronic control units (ECUs). Sets ofconflicts are generated when the diagnostic tests are evaluatedin the ECUs, and theECUs then compute sets of minimal lo-cal diagnoses. These embedded distributed systems typicallyconsist ofECUs with both limited processing power and lim-ited RAM memory. The algorithm presented here is thereforeconstructed such that it requires low processing power andlow memory usage. In these systems, it should be possibleto exchange, add, or removeECUs without having to do anychanges to the diagnostic software. The algorithm presentedin this paper is therefore constructed such that it can handlesuch changes. Requirements on diagnostic systems used inautomotive vehicles are discussed in Section 2.

1.1 Related WorkMost research, such as[Reiter, 1987], has been aimed at thecentralized diagnosis problem. These methods can also beused for distributed systems by letting a central diagnosticagent collect conflicts from the system and then calculate theminimal global diagnoses. It is not always suitable to use


a dedicated central diagnostic agent due to for example lim-ited computing resources in each agent, robustness againstagent disconnection, and the possibility to add new agentsto the network. It therefore exist algorithms, see for exam-ple [Provan, 2002], which compute the minimal global diag-noses in a cooperation between the agents. These algorithmsaim at the complete set of global diagnoses, while the methodpresented here aims at the set of global diagnoses with mini-mal cardinality.

There also exist algorithms where the agents update thesets of local diagnoses such that these are consistent with theglobal diagnoses[Rooset al., 2003]. The method does notguarantee that a combination of the agents’ local minimal di-agnoses is also a global minimal diagnosis. However, for ev-ery global minimal diagnosis, there is a combination of localminimal diagnoses. The updated sets of local diagnoses rep-resent the global diagnoses without actually computing thecomplete set of global diagnoses. The method presented inthis paper updates the local diagnoses such that these repre-sent the global diagnoses with minimal cardinality.

In [Biteuset al., 2005], a method was presented that calcu-lated the global diagnoses with minimal cardinality by trans-mitting the minimal local diagnoses from one agent to an-other, which adds its minimal local diagnoses, then transmitthe result to the next agent, etc. Even though the method isefficient, it might not give a sufficient distributed diagnos-tic system since it requires a lot of cooperation between theagents. In[Biteus et al., 2005], the computational burdenwhen calculating the global diagnoses with minimal cardinal-ity was reduced by partitioning the system into two or moresub-systems, whose minimal local diagnoses did not sharecomponents with each other. The partition approach has sim-ilarities with the tree reduction technique used in[Wotawa,2001] and can also be used for the algorithm presented here.

Related to this work is also theEU funded project Multi-Agents-based Diagnostic Data Acquisition and Managementin Complex Systems (MAGIC) which develops an architectureuseful for distributed diagnosis[Koppen-Seligeret al., 2003].The project discusses protocols for network communication,control algorithms, and other aspects of the integration of di-agnosis in distributed systems.

2 Requirements on Diagnostic Systems usedin the Automotive Industry

To better understand the industrial demands on diagnosis, thedistributed diagnostic systems used in ScaniaAB heavy-dutytrucks have been analyzed. These systems consist ofECUsconnected to each other via a controller area network (CAN).The software embedded in theECUs is primary used for con-trol and monitoring.

2.1 An Example of a Distributed SystemOne configuration of the distributed system in Scania’sheavy-duty vehicles is shown in Figure 1. The system in-cludes three separateCAN buses, the red, the yellow, and thegreen. Each of theECUs is connected to sensors and actua-tors, and both sensor values and control signals can be sharedwith the otherECUs over the network. One example of an

15−

pole

7−po

le

Trailer

Yel

low

bus

Red bus

Green bus

Diagnostic bus

ISO

1199

2/2

ISO

1199

2/3

Coordinatorsystem

EMS

system

ACS

system

BMSBrake management

system

GMS

ment systemEngine managementArticulation controlGear box manage−

COO

ACCAutomatic climatecontrol

CSSCrash safety

WTAAuxiliary heater

system water−to−air

RTGRoad transport

informatics gateway

CTSClock and timersystem

system

AUSAudio system

ATAAuxiliary heatersystem air−to−air

ICLInstrument cluster

system

AWDAll wheel drivesystem

TCOTachographsystem

LAS

system

VISVisibilitysystem

Locking and alarm

management dolly

EECExhaust Emission

Control

SuspensionSMD

management dolly

Suspension

SMD

Figure 1: The distributed system in current Scania heavy-dutyvehicles.

ECU is the engine management system, which is connectedto sensors and actuators related to the engine. There can beup to about 30ECUs in the system, depending on the typeof the truck, and roughly between 4 and 110 components arediagnosed by eachECU.

The ECUs’ CPUs have typically a clocking speed of 8to 64MHz, and aRAM memory capacity of about 4 to 150 kB.CAN buses can typically transfer 100 to 500 kbit/s. As thesenumbers indicate there is not much computational, memory,nor network capacity available, especially when consideringthat theECUs should be used for both control and diagnosis.

2.2 Overall Requirements on Distributed Systems

A distributed system that can present the same information tousers, as if it were a centralized system, can be denoted trans-parent[Tanenbaum and van Steen, 2002]. Considering faultdiagnosis, one interpretation of transparency is that the mini-mal diagnoses presented by the distributed diagnostic systemshould be the same as those presented by a centralized di-agnostic system, meaning that the minimal global diagnosesshould be presented, not only the minimal local diagnoses.Another interpretation of transparency is that, even thoughone ECU fails to deliver its minimal local diagnoses the re-maining system should still be able to deliver the minimalglobal diagnoses. This means that the diagnostic processesshould be distributed among theECUs, or if a centralized di-agnosticECU is used, backupECUs should exist.

If it is possible to increase or decrease the size of the sys-tem, without changes in the software, the system can be saidto be scalable[Tanenbaum and van Steen, 2002]. Consider-ing a truck, it should be possible to attach new parts includingnew ECUs to the network without having to change the soft-ware in theECUs.

If it is possible to exchange anECU to for example a newversion without having to change the software in the otherECUs, the system can be said to be interoperable[Tanenbaumand van Steen, 2002]. This is especially important in automo-


tive systems where it frequently occurs that parts are replacedby parts from other manufacturers.

2.3 Requirement Conclusions

An algorithm for distributed diagnosis used in automotive ve-hicles, should require a limited processing power, memoryusage, and network load. The algorithm should further resultin a transparent, scalable, and interoperable system.

The algorithm presented in this paper synchronizes theminimal local diagnoses, which results in a transparent, scal-able, and interoperable system. The condensed diagnoseswith minimal cardinality do not include components that arenot of interest and this reduces computational load, memoryusage, and network load.

3 Consistency Based Diagnosis

A system consists of a set of componentsC, where a compo-nent is something that can be diagnosed. This not only in-cludes components directly connected to the agents, such assensors and actuators, but also includes components sharedbetween the agents, e.g. cables and pipes.

To reduce the complexity of the diagnostic system, it issometimes preferable to only consider the abnormalAB andthe not abnormal¬AB mode, where theAB mode does nothave a model. This means that the minimal diagnosis hy-pothesis is fulfilled[de Kleeret al., 1992], and therefore thenotation in for exampleGDE will be employed.

A diagnosis is a set of componentsD ⊆ C, such thatthe components’ abnormal behaviors, the remaining compo-nents’ normal behaviors, the system description, and the ob-servations are consistent. Since the minimal diagnosis hy-pothesis is fulfilled andD is a diagnosis, all supersets ofDare also diagnoses. Further, a diagnosisD′ is a minimal di-agnosis if there is no proper subsetD ⊂ D′ whereD is adiagnosis[de Kleeret al., 1992].

An evaluation of a diagnostic test results in a conflict ifsome components, checked by the test, have been found tobehave abnormal. A conflict is a set of componentsπ ⊆ C,such that the components’ normal behaviors, the system de-scription, and the observations are inconsistent. A setD ⊆ Cis a diagnosis if and only if it has a nonempty intersectionwith every conflict in a set of conflicts. A consequence of thisis that the set of minimal diagnoses is exactly determined bythe set of minimal conflicts[de Kleeret al., 1992].

In some cases, it is computationally intractable to calculatethe complete set of minimal diagnoses. To reduce the compu-tational cost, the search can be focused on the diagnoses withminimal cardinality, as described in for example[de Kleer,1990]. Let D be a set of diagnoses, then the set of minimalcardinality diagnoses is the setDmc = D ∈ D : |D| =minD∈D |D|.

4 Distributed Diagnosis

This section will present the framework for distributed sys-tems that will be used to describe how condensed diagnoseswith minimal cardinality can be calculated.

1Agent A

input

A B DC

outputNetwork

signal

2Agent ADiag−nosis

Figure 2: Agents, network, components, and diagnosis.

4.1 Outputs, Inputs, and Components inDistributed Systems

A distributed system consists of a set of agentsA. A local di-agnosis is determined by the conflicts in a single agent, whilea global diagnosis is determined by all agents’ conflicts.

Here, the complete set of components are partitioned intoprivate componentsP ⊆ C and common componentsG ⊆ C,whereP ∩ G = ∅. A private component is only used by oneagent, while a common component is used by two or moreagents. The set of private components is partitioned into dif-ferent sets belonging to different agents such that for two suchsets,PAi ∩ PAj = ∅, whereAi, Aj ∈ A. The setXA is thesubset ofX that is used in agentA.

In addition to components, an agent in a distributed sys-tem should also be able to diagnose inputs from other agents.The outputs are values from sensors, to actuators, or from cal-culations, which are made available to the other agents overthe network. The complete set of signalsS, a set of inputsIN ⊆ S, and a set of outputsOUT ⊆ S is here used. Eachoutputσ ∈ OUT is connected to a subset of inputsΓ ⊆ IN .

Example 1:Figure 2 shows a typical layout of agents andcomponents. The system consists of two agents, a network,and four sensor components,A to D. The sensorsA andBare physically connected to agentA1, while the sensorsCandD are connected toA2. A diagnosis in agentA1 couldfor example include the componentsA, B, andC, connectedwith dashed lines. AgentA1 diagnoses indirectly sensorCthrough the signal transmitted over the network. ⋄

An output might depend on other components, such as sen-sors, and the information about this relationship is stored inthe output’s assumptions.

Definition 1 (Assumption) Lets be a signal which is an out-put from agentA or an input that is connected to the outputfrom agentA. Let the setass(s) ⊆ PA ∪ G ∪ INA. If s isabnormal if and only if some non-empty subsetC ⊆ ass(s) isabnormal, thenass(s) is the set of assumptions fors.

Each output depends on components and other inputs. Thisdependency can be propagated to a set consisting only ofcomponents.

Definition 2 (Dependency)Let s ∈ S be a signal, then thedependency fors is

dep(s) = ass(s) ∩ C ∪⋃

t∈ass(s)∩S

dep(t).


Network

Agents

Components

ComponentsCommon

Private

1 A2

G

B C

A

s

A

Figure 3: An example of a two agent system.

Since thedep(·) function is defined implicit, the possibilityof loops has to be considered in an implementation.

Example 2:Continuation of Example 1. The assumption ofthe output isass(output) = C. The dependency equals theassumption since the assumption does not include any signal.

⋄

4.2 Diagnoses & Conflicts on Components & InputsAn agent should state diagnoses that include both compo-nents and inputs. The diagnoses in Section 3 cannot beused directly since these only include components. Instead, acomponent-input diagnosis is defined on the setC ∪ IN .

Definition 3 (Component-input diagnosis) A setD = C ∪Γ ⊆ Θ, C ⊆ C, Γ ⊆ IN , is a component-input diagnosis ifthe setC ∪ C, where∀i ∈ Γ : C ∩dep(i) 6= ∅, is a diagnosis.

A component-input diagnosis will simply be denoted a diag-nosis when no misunderstanding is imminent. As with diag-noses, conflicts can also be defined uponC ∪ IN .

4.3 Condensed Diagnoses Representing GlobalDiagnoses

It was mentioned in the introduction that the condensed diag-noses with minimal cardinality should be calculated in eachagent.

Definition 4 (Condensed diagnosis)Let D be a set ofglobal diagnoses. The tuple〈D, k〉, whereD ⊆ PA ∪ G ∪INA and k ∈ Z, is a condensed diagnosis in agentA if∃D ∈ D, such that

|D| + k = |D|, D ∩ P = D ∩ PA, D ∩ G = D ∩ G,

D ∩ IN = i : dep(i) ∩ D\D 6= ∅, i ∈ IN.

The condensed diagnosis〈D, x〉 in agentA, is a tuplewhere the setD represents the subset of some global diag-noses, diagnosisD in the definition, including componentsused by agentA. Variablek represents the components notincluded inD but included inD.

Interpretation of the different requirements for a condenseddiagnosis:|D| + k = |D| means that the cardinality pluskshould equal that of the global diagnosis with minimal car-dinality; D ∩ P = D ∩ PA means thatD should only in-clude private components used by agentA; D ∩ G = D ∩ Gmeans that the common components should be included;

D ∩ IN = i : dep(i) ∩ D 6= ∅, i ∈ IN means thatinputs, that might be faulty due to its dependency on somefaulty components, should be included inD.

Example 3:Consider the system shown in Figure 3. Thereexist a signals whose dependencydep(s) = B, repre-sented by the dotted line. The sets of private components arePA1 = A andPA2 = B, C. The set of common com-ponents isG = G. LetA, B, C, G be a global diagnosis.

A condensed diagnosis in agentA1 is 〈A, G, s1, 1〉.ComponentA is included since it is a private component inA1, G since it is a common component, ands since it dependson the faulty componentB. ComponentC is represented byk = 1 since it does not affect agentA1. ⋄

Theminimal cardinality condensed diagnosesis the set ofcondensed diagnoses where each condensed diagnosis is asubset of some minimal cardinality global diagnoses , i.e. inDefinition 4, the set of global diagnosesD is the set of min-imal cardinality global diagnoses. The objective of the algo-rithm described in the next section is to calculate the sets ofminimal cardinality condensed diagnoses in each agent.

5 Algorithm for Calculating the MinimalCardinality Condensed Diagnoses

The main idea of the algorithm presented in this Section isthat each agent first find and then transmits the subset of min-imal local diagnoses that other agents might be interestedof. Each agent then receives the transmitted diagnoses andmerges these with its own set of minimal local diagnoses re-sulting in the minimal cardinality condensed diagnoses.

The transmitting part is described in Section 5.2, the re-ceiving and merging is described in Section 5.3, and finallythe main algorithm is described in Section 5.4.

In the algorithms,D is some diagnosis,Γ ⊆ IN , Ω ⊆OUT , P ⊆ P , andG ⊆ G.

5.1 Outputs Dependent on Inputs

The algorithm, as written, requires thatass(s) ⊆ C, i.e. anoutput’s value does not depend on any signal. This can befulfilled for a general system, whereass(s) ⊆ C ∪ IN , byreplacing the assumptions with the corresponding dependen-cies, such thatass(s) := dep(s) ⊆ C.

5.2 Transmit Diagnoses

The first step is to find and transmit the subset of minimallocal diagnoses that is of interest to the other agents.

A minimal local diagnosis should be transmitted if it in-cludes common components, inputs, or components that out-puts depends on, since these might affect other agents. Foreach diagnosis that should be transmitted, the private compo-nents can be removed since these do not directly affect otheragents. If some outputs depend on any of the removed privatecomponents these outputs are instead added to the diagnosis.The minimal local diagnoses that are not transmitted on thenetwork can be represented by a variablen ∈ N, which is theminimal cardinality of the non-transmitted local diagnoses.The agents receiving the diagnoses will then be aware that


Algorithm 1 Transmit diagnosestransmit(A, m).

Require: A set of minimal local diagnosesDA, limit m.Ensure: SetTX broadcasted on the network.1: D

TX := D ∈ DA : |D| ≤ m

2: DTX := D ∈ D

TX :D ∩ (IN ∪ G ∪ (∪σ∈OUT A ass(σ))) 6= ∅

3: n := minD∈DA\DT X |D|

4: TX := 〈D, k〉 : D ∈ DTX , D = P ∪ G ∪ Γ, D =

G ∪ Γ ∪ Ω, Ω = σ ∈ OUT A : ass(σ) ∩ P 6= ∅, k =|P | − |Ω|

5: if |DTX | 6= |DA| then6: TX := TX ∪ 〈∅, n〉7: end if8: BroadcastTX on the network.

there exist one or more non-transmitted local diagnoses withcardinalityn.

Algorithm 1 performs the steps described above. Since theminimal cardinality condensed diagnoses are searched for,the algorithm accepts a maximum limitm on the cardinalityof the local diagnoses to be transmitted. Row 2 decides whichdiagnoses that should be transmitted. Row 4 constructs a tu-ple including the diagnosis without private components and avariablek ∈ N representing the removed private components.

Example 4:Consider the system shown in Figure 4, whereobjects connected to the agents with solid lines are includedin some minimal local diagnosis, while those connected withdashed lines are not included. The sets of private componentsarePA1 = A, PA2 = B, C, D, andPA3 = E. Theset of common components isG = G. There exist twosignals with assumptionsass(s1) = B andass(s2) = E.

Assume that the following set of minimal local diagnoseshas been calculated in agentA2

DA2 = C, s2, B, C, G, C, D.

Using Algorithm 1 withm ≥ 2, the set

DTX = C, s2, B, C, G, C

is first calculated. The variablen = 1, i.e. the cardinality ofD. The transmitted set of tuples is

TX = 〈s2, 1〉, 〈s1, 1〉, 〈G, 1〉, 〈∅, 1〉

where the private components have been removed. This setwill representDA2 in agentsA1 andA3. ⋄

5.3 Receive and Merge DiagnosesThe second step is to receive the transmitted sets, transformthem into an appropriate form, and then calculate the minimalcardinality condensed diagnoses.

If a received diagnosis include a signals that are an outputfrom the receiving agent then the receiver know which com-ponents thats depend on, ands is therefore replaced withthe assumptionass(s). After the replacement, the minimallocal diagnoses and the received diagnoses can be merged toform a set of condensed diagnoses. If a condensed diagnosisincludes several inputs and if several of these inputs depend

A3A2A1

s1

s2

E

Agents

CB DA

G

PrivateComponents

CommonComponents

Network

Figure 4: An example of a three agent system.

on the same component then the cardinality of the condenseddiagnosis will not be correct. Consider for example two sig-nalss1 ands2 depending on componentA. The cardinality of〈s1, s2, 0〉 is two while the cardinality of the correspondingglobal diagnosisA is one. The condensed diagnosis shouldbe〈s1, s2,−1〉 where the minus one a compensation.

Algorithm 2 performs the steps described above. The al-gorithm transforms the received sets of tuples by replacinginputs that are outputs from the current agent, with the com-ponents in the outputs’ assumptions, row 1–3. The func-tion MHS(M) calculates the minimal hitting set for thecollection of setsM . For example, the minimal hitting setMHS(A, B, B, C) = A, C, B. To be able tocalculate the compensation discussed above, the set of inputsis partitioned into the setΓj , which is outputs that was addedto the diagnosis whenAj transmitted the tuple, and the setΓj ,which is the inputs to agentAj without the outputs fromAi.In row 5 the received sets of tuples are merged. In row 6, thecondensed diagnoses are compensated for signals dependingon the same components. Finally, in row 7, the condensed di-agnoses, which do not have minimal cardinality, is removed.

Algorithm 2 Calculate the minimal cardinality condensed di-agnosescondense(Ai).

Require: Received setsTXAj from all agentsAj 6=i as a re-sult of evaluatingtransmit(Aj , m). Set of minimallocal diagnosesDAi .

Ensure: Set of minimal cardinality condensed diag-nosesDAi

s .1: for all j 6= i do2: RXAj := 〈P ∪G∪ G∪ Γ∪ Γ, k〉 : 〈G∪Γ∪Ω, k〉 ∈

TXAj , Γ = Ω, Ω = Γ ∩ OUT Ai , Γ = Γ \ Ω, H ∈MHS(∪σ∈Ωass(σ)), P = H ∩ P , G = H ∩ G

3: end for4: RXAi := 〈D, 0〉 : D ∈ D

Ai5: D

Ais := 〈H, k〉 : 〈Dj , kj〉 ∈ RXAj ,

H = ∪jDj , k =∑

j kj.6: D

Ais := 〈D, k + comp(D)〉 : 〈D, k〉 ∈ D

Ais

7: DAis := 〈D, k〉 ∈ D

Ais , k = min

〈D,k〉∈DAis

k

The value ofk + comp(D) should be the difference be-tween the cardinality ofD and a corresponding minimal car-


Algorithm 3 Functioncomp(D).Input: DiagnosisD.Output: Variablek.1: Each diagnosis is constructed such that

D = (P i ∪ Gi ∪ Γi) ∪j 6=i (P j ∪ Gj ∪ Gj ∪ Γj ∪ Γj)

2: Zj := (Γi ∪k 6=i Γk) ∩ OUT Aj

3: kj21 := minH∈MHS(∪

σ∈Zj ass(σ)) |H ∩ P| − |Zj |

4: kj22 := minH∈MHS(∪

σ∈Γj∩Zj ass(σ))∩P |H |−|Γj ∩Zj |

5: k2 :=∑

j(kj21 + k

j22)

6: ZG := Gi ∪j 6=i (Gj ∪ Gj)7: k3 := minH∈MHS(∪

σ∈∪j 6=iZj ass(σ))∩G |H |− |ZG∩H |

8: k := k2 + k3

dinality global diagnosis. The variablek is calculated inAlgorithm 1, while the functioncomp(D) is given by Al-gorithm 3. In the algorithm,k2 is the compensation forsignals depending on the same private components, whilek3 is the compensation for signals depending on the samecommon components. The following example will illustratecomp(D).

Example 5:Consider an agentA2 with outputss1 ands2,private componentsPA2 = A, and assumptionsass(s1) =A andass(s2) = A.

Let a minimal local diagnosis in agentA1 beD = s1, s2and letA2 have an empty set of minimal local diagnoses. Theminimal cardinality condensed diagnosis inA2, before eval-uatingcomp(D), is in this case〈s1, s2, 0〉. A minimal car-dinality global diagnosis isA, and the cardinality of theminimal cardinality condensed diagnosis is therefore not cor-rect,|s1, s2|+0 6= |A|. Usingcomp(D), row 1 give thatΓ2 = s1, s2. Row 3 give thatk2

21 = |A| − |s1, s2| =−1. The result is the minimal cardinality condensed diagno-sis〈s1, s2,−1〉, which has correct cardinality.

Assume now that agentA2 has the minimal local diagno-sisA, and agentA1 has an empty set of minimal local di-agnoses. AgentA2 transmit the set〈s1, s2,−1〉 whichis the set of minimal cardinality condensed diagnosis inA1.Usingcomp(D), the setΓ2 = s1, s2 and the result is thatk2 = 0. In the second example, the compensation was donein the transmitting agent, while in the first example, the com-pensation had to be done in the receiving agent. ⋄

How computationally difficult is thecomp(D) function?Consider the special case whereass(σ) ⊆ PAi andass(σk)∩ass(σl) = ∅, which means that a signal only depends onprivate components and that no two signals depends on thesame component. In this simplified casecomp(D) = 0. Ifass(σ) ⊆ P , i.e. a signal only depends on private compo-nents, then the variablek3 = 0 while k2 might be some non-zero value. The more connections that exist between signalsdependencies, and the more common components that exist,the more computationally complex willcomp(D) be.

The following lemma shows how the minimal cardinalitycondensed diagnoses could be calculated.

Lemma 1 LetDmc be the set of minimal cardinality globaldiagnoses, and lettransmit(Aj ,∞) be evaluated for all

Algorithm 4 Main algorithm.

Require: Set of minimal local diagnosesDA in all agents.Result: Set of minimal cardinality condensed diagnoses.1: Decide with voting:m1 = maxA∈A minD∈DA |D|2: ∀A ∈ A evaluatetransmit(A, m1)3: ∀A ∈ A evaluatecondense(A)4: Decide with voting:m2 = maxA∈A minD∈DA

s|D|

5: ∀A ∈ A evaluatetransmit(A, m2)6: ∀A ∈ A evaluatecondense(A)

agentsAj ∈ A. LetDAis be the result after evaluating Algo-

rithm 2 in agentAi. ThenDAis is the set of minimal cardinal-

ity condensed diagnoses.

Proof The complete proof is given in[Biteuset al., 2006].Outline of the proof: A diagnosis〈De, k〉 ∈ D

Ais is by con-

struction a condensed diagnosis if|De| + k = |Dg|, wheree referees to condensed andg to global. First show the car-dinality of Dg ∈ Dmc, second show that the cardinality ofDe in 〈De, k〉 ∈ D

Ais , and finally show that the cardinal-

ities are equal. It is found that|De| + X = |Dg| whereX =

∑j 6=i(k

j1 + k

j2)+ k3, andk

j1 is identified ask1 in Algo-

rithm 1 in agentAj . The variableskj2 andk3 is identified as

the corresponding variables in Algorithm 3.

Example 6:Continuation of Example 4. Assume that agentA2 has received the following sets of tuples from agentA1

andA3,

TXA1 = 〈s1, 0〉, 〈∅, 1〉 TXA3 = 〈s2, 0〉

Using Algorithm 2, the sets

RXA1 = 〈B, 0〉, 〈∅, 1〉 RXA3 = 〈s2, 0〉

is calculated. Assume that agentA2 has the following set ofminimal local diagnoses

DA2 = C, s2, B, C, G, C, D

which is transformed with the algorithm to the set

RXA2 = 〈C, s2, 0〉, 〈B, C, 0〉, 〈G, C, 0〉, 〈D, 0〉.

TheRX sets are merged, resulting in the set

DA1

s = 〈B, C, s2, 0〉, 〈s2, C, 1〉, 〈s2, D, 1〉,

which is the set of minimal cardinality condensed diagnoses.⋄

5.4 Main AlgorithmLemma 1 shows that the sets of minimal cardinality con-

densed diagnoses can be calculated with Algorithm 1 to 3.However, it is possible to reduce the computational burdenby using the cardinality limitm in Algorithm 1.

Algorithm 4 first calculates a lower boundm1 on the cardi-nality of the minimal cardinality global diagnoses. The mini-mal cardinality condensed diagnoses are then calculated withthis lower bound as input totransmit(·). The algorithmthen computes an upper bound on the cardinality of the global


3A2A1

AH I

Agents

E

Network

F

s

JB C

G

PrivateD

CommonComponents

A

Components

Figure 5: A system including relatively few signals comparedto the number of components.

diagnoses with minimal cardinalitym2. Sincem2 is the car-dinality of a global diagnosis, it is known that the cardinalityof a minimal cardinality global diagnosis is less than or equalto m2. The local diagnoses with cardinality greater thanm2

can therefore not be part of a minimal cardinality global di-agnosis and can therefore be ignored. The result is describedin Theorem 1. The result after evaluating Algorithm 4 is thatall agents have a set of minimal cardinality condensed diag-nosesDA

s .The reason for using Algorithm 4 is that it is, in most

cases, more efficient than using the algorithm as describedin Lemma 1.

Even though Algorithm 4 is written as two separated parts,the result of part one should in an implementation be usedwhen calculating part two. The correctness of the algorithmis shown in Theorem 1.

Theorem 1 Same assumptions as in Lemma 1, but letDAis be

the result after evaluating Algorithm 4 in agentAi. Then thesetDAi

s is the set of minimal cardinality condensed diagnoses.

Proof Follows from Lemma 1.

6 Example using the AlgorithmsTwo examples will be studied in this section.

Example 7: Consider the system shown in Figure 5. Itincludes three agents with the sets of private componentsPA1 = A, H, I, PA2 = B, C, D, PA1 = E, F, J,the set of common componentsG = G, and the set of sig-nalss.

The following sets of conflicts have been detected,ΠA1 =A, H, G, A, I, G, ΠA2 = C, D, andΠA3 =F, J. The sets of minimal local diagnoses are calculatedfrom the conflicts resulting in the sets

DA1 = A, G, H, I D

A2 = C, D

DA3 = F, J.

The following sets of tuples are transmitted to agentA1

from A2 andA3.

TXA2 = 〈∅, 2〉 TXA3 = 〈∅, 1〉.

The received sets in agentA1 are

RXA1 = 〈H, I, 0〉, 〈A, 0〉, 〈G, 0〉

RXA2 = 〈∅, 2〉 RXA3 = 〈∅, 1〉.

Components

A3A2A1

s1

s2

s3

Agents

Network

CB D E FA

G

PrivateComponents

Common

Figure 6: A system including relatively many signals com-pared to the number of components.

In agentA1, the resulting set of minimal cardinality con-densed diagnoses isDA1

s = 〈A, 3〉, 〈G, 3〉. To ver-ify this set, the set of minimal cardinality global diagnoses iscalculated

Dmc = A, C, D, F, A, C, D, J,

G, C, D, F, G, C, D, J.

Is the minimal cardinality condensed diagnoses correct?Consider the condensed diagnosisD = 〈A, 3〉, and theminimal cardinality global diagnosisD = A, C, D, F.Using Definition 4 to verify thatD is a condensed diagno-sis, |D| + 0 = |D|, D ∩ P = D ∩ PA = A, andD ∩ G = D ∩ G = ∅. Further

D ∩ IN = i : dep(i) ∩ D\D 6= ∅, i ∈ IN = ∅.

This shows thatD is a condensed diagnosis, and sinceD isa minimal cardinality global diagnosis,D is a minimal cardi-nality condensed diagnosis.

The condensed diagnosis〈A, 3〉 represents the first andthe second minimal cardinality global diagnoses, while thecondensed diagnosis〈G, 3〉 represents the other. ⋄

In the example above, the minimal cardinality condenseddiagnoses was a condensed and efficient representation of theminimal cardinality global diagnoses. The next example willbe used to exemplify when the minimal cardinality condenseddiagnoses is not a condensed and efficient representation.

Example 8:Consider the system shown in Figure 6. Thesystem includes three agents with the sets of private com-ponentsPA1 = A, PA2 = B, C, D, PA1 = E, F,set of common componentsG = G, and the set of sig-nalss1, s2, s3. The assumptions areass(s1) = B, C,ass(s2) = C, D, ass(s3) = E.

The following sets of conflicts have been detected,ΠA1 =s1, A, G, s3, A, G, ΠA2 = s3, D, andΠA3 =E, F. The minimal local diagnoses is calculated fromthe object conflicts resulting in the sets

DA1 = s1, s3, A, G D

A2 = s3, D

DA3 = E, F.

The following sets of tuples are transmitted to agentA1

from A2 andA3.

TXA2 = 〈s3, 0〉, 〈s2, 0〉 TXA3 = 〈s3, 0〉, 〈∅, 1〉


The received sets in agentA1 are

RXA1 = 〈s1, s3, 0〉, 〈A, 0〉, 〈G, 0〉

RXA2 = 〈s3, 0〉, 〈s2, 0〉

RXA3 = 〈s3, 0〉, 〈∅, 1〉

Resulting set of minimal cardinality condensed diagnoses

DA1

s = 〈s1, s3, 0〉, 〈s1, s2, s3,−1〉,

〈A, s3, 0〉, 〈G, s3, 0〉,

where the−1 in the second condensed diagnosis means thatthe true cardinality is one less than the cardinality for the sets1, s2, s3.

To be able to verify the correctness of the minimal cardi-nality condensed diagnoses, the minimal cardinality globaldiagnoses are calculated. The set of conflicts, not includingsignals, isΠ = B, C, A, G, E, A, G, E, D, E, Fand the set of minimal cardinality global diagnoses is

Dmc = B, E, A, E, C, E, G, E.

Is the minimal cardinality condensed diagnoses correct?Consider the condensed diagnosisD = 〈s1, s3, 0〉, andthe minimal cardinality global diagnosisD = B, E. Us-ing Definition 4 to verify thatD is a condensed diagnosis,|D|+0 = |D|, D∩P = D∩PA = ∅, andD∩G = D∩G = ∅.Further

D ∩ IN = i : dep(i) ∩ D\D 6= ∅, i ∈ IN = s1, s3

sinceass(s1)∩ D 6= ∅ andass(s3)∩ D 6= ∅. This shows thatD is a condensed diagnosis, and sinceD is a minimal cardi-nality global diagnosis,D is a minimal cardinality condenseddiagnosis. ⋄

As can be seen in the above example, there where quitesome calculations that had to be performed compared to thecalculation of the minimal cardinality global diagnoses. Ifa system has a high degree of components used by severalagents, the minimal cardinality condensed diagnoses will in-clude relatively many components. The reduction of size andthe number of diagnoses will in this case be limited, and theefficiency of the algorithm reduced, as was seen in the exam-ple.

Considering automotive systems, theECUs typically havea large number of private components compared to both thenumber of inputs and the number of common components. Itis therefore applicable to use the algorithm for these systems.

7 ConclusionsThe objective when designing the algorithm described in Sec-tion 5, was to gain a diagnostic algorithm that used low pro-cessing power, low memory usage, low network load, and re-sulted in a transparent, scalable, and interoperable distributedsystem, see Section 2.

An algorithm has been presented in Section 5, that syn-chronizes the minimal local diagnoses in a distributed system.The result is a set of minimal cardinality condensed diagno-sis, where each minimal cardinality condensed diagnosis isa subset of some minimal cardinality global diagnoses, see

Theorem 1 and Lemma 1. The minimal cardinality condenseddiagnoses only include components that are used by the spe-cific agent and is therefore a local condensed representationof the minimal cardinality global diagnoses.

A diagnostic system, using the algorithm presented in thispaper, is transparent since the loss of an agent would onlymean that this agent would not transmit its minimal local di-agnoses on the network. It is scalable since it is directly pos-sible to add newECUs to the network. The algorithm requiresa low processing load and memory usage since unwanted pri-vate components have been removed from the condensed di-agnoses with minimal cardinality.

References[Biteuset al., 2005] J. Biteus, M. Jensen, and M. Nyberg. Dis-

tributed diagnosis for embedded systems in automotive vehicles.In Proceedings ofIFAC World Congress’06, Prague, Czech Re-public, 2005.

[Biteuset al., 2006] Jonas Biteus, Erik Frisk, and Mattias Nyberg.Condensed representation of global diagnoses with minimal car-dinality in local diagnoses – extended version. Technical report,Dept. of Electrical Engineering, Linkopings Universitet, Sweden,2006. To be published.

[de Kleeret al., 1992] J. de Kleer, A. K. Mackworth, and R. Reiter.Characterizing diagnoses and systems.Artificial Intelligence, 56,1992.

[de Kleer, 1990] J. de Kleer. Using crude probability estimates toguide diagnosis.Artificial Intelligence, 45:197–222, 1990.

[de Kleer, 1991] J. de Kleer. Focusing on probable diagnoses. InProceedings of 9th National Conf. on Artificial Intelligence, Ana-heim,U.S.A ., 1991.

[Hayes, 1999] C. C. Hayes. Agents in a nutshell-a very brief intro-duction. Knowledge and Data Engineering,IEEE Transactionson, 11(1):127–132, Jan/Feb 1999.

[Koppen-Seligeret al., 2003] B. Koppen-Seliger, T. Marcu,M. Capobianco, S. Gentil, M. Albert, and S. Latzel. MAGIC:An integrated approach for diagnostic data management andoperator support. InProceedings of IFAC Safeprocess’03,Washington,U.S.A ., 2003.

[Provan, 2002] G. Provan. A model-based diagnosis framework fordistributed systems. In13th International Workshop on Princi-ples of Diagnosis, Semmering, Austria, May 2002.

[Reiter, 1987] R. Reiter. A theory of diagnosis from first principles.Artificial Intelligence, 32(1):57–95, Apr 1987.

[Rooset al., 2002] N. Roos, A. ten Teije, A. Bos, and C. Wit-teveen. An analysis of multi-agent diagnosis. InProceedings ofthe Conference on Autonomous Agents and Mult-Agent Systems,Bologna, Italy, Jul 2002.

[Rooset al., 2003] N. Roos, A. ten Teije, and C. Witteveen. A pro-tocol for multi-agent diagnosis with spatially distributed knowl-edge. InProceedings of 2nd Conference on Autonomous Agentsand Mult-Agent Systems, Australia, Jul 2003.

[Tanenbaum and van Steen, 2002] Andrew S. Tanenbaum andMaarten van Steen.Distributed Systems. Prentice Hall, 2002.

[Weiss, 1999] G. Weiss, editor. Multiagent systems : a modernapproach to distributed artificial intelligence. MIT Press, Cam-bridge, Mass.,U.S.A, 1999.

[Wotawa, 2001] F. Wotawa. A variant of Reiter’s hitting-set algo-rithm. Information Processing Letters, 79, 2001.


Focusing fault localization in model-based diagnosis with case-based reasoningAnıbal Bregon†, Belarmino Pulido†, M. Aranzazu Simon†, Isaac Moro†,

Oscar Prieto†, Juan J. Rodrıguez Diez‡, Carlos Alonso-Gonzalez††Intelligent Systems Group (GSI), Department of Computer Science,

E.T.S.I. Informatica, University of Valladolid, Valladolid, Spainanibal,belar,arancha,isaac,oscapri,[email protected]

‡Department of Civil Engineering, University of Burgos, Burgos, [email protected]

AbstractConsistency-based diagnosis automatically pro-vides fault detection and localization capabilities,using just models for correct behavior. However, itmay exhibit a lack of discrimination power. Knowl-edge about fault modes can be added to tackle theproblem. Unfortunately, it brings additional com-plexity issues, since it will be necessary to discrimi-nate among a maximum of KM mode assignments,for M components and K possible fault modes percomponent.Usually, some kind of heuristic information is in-cluded in the diagnosis process to focus the model-based diagnostician. In this work we study thecombination of a consistency-based diagnosis sys-tem together with a Case-based Reasoning system.The consistency-based diagnosis will perform faultdetection and localization. The CBR system pro-vides accurate indication of the most probable faultmode, at early stages of the localization process.

Keywords: Fault Diagnosis, Model-based Diagnosis,Case-Based Reasoning

1 IntroductionConsistency-based diagnosis is the most used approach tomodel-based diagnosis in the Artificial Intelligence commu-nity, also known as the DX approach. One of its main advan-tages is that it just requires correct-behavior models to per-form fault detection and localization [Hamscher et al., 1992].However, such feature may lead to the “every component in-volved in the conflict” syndrome [Dressler, 1996]: we justknow one component has failed, but there is almost no dis-criminative power. This fact is specially true while diagnos-ing dynamic systems with scarce observability. Even more,pure consistency-based diagnosis can lead to logically sound,but physically impossible diagnosis [Dressler and Struss,1996].

Usually, to solve such drawback, knowledge about faultmodes is introduced. This can be done in several differentways. For instance, we could rely upon an abductive ap-proach to model-based diagnosis [Poole, 1989]. In such acase, knowledge about faults is used to explain observations.

However, that approach can provide inconsistent results, froma logical point of view.

Since we want to retain the logical soundness in our re-sults, we have just explored those approaches within the pureconsistency-based approach: fault modes will be rejected iffaulty behavior estimation is not consistent with observations.Just consistent fault models will remain. Even within that ap-proach, knowledge about fault modes can be introduced indifferent, and complementary, ways:

• Non-predictive approaches have little knowledge aboutfault modes. Two examples are: physical impossibility,[Friedrich et al., 1990], which just describes impossiblephysical behavior, or non-intermittency, [Raiman et al.,1991], which takes advantage of information about non-intermittent faults.

• Predictive approaches, which use models for faultmodes to estimate faulty behavior, as in Sherlock[de Kleer and Williams, 1989] or GDE+ [Struss andDressler, 1989]. Based on such estimation, non-consistent fault modes are rejected. In such approaches,one fault mode can only be confirmed, if any other faultmode has been rejected and there is no unknown faultmode. This is the selected approach.

Nevertheless, the increase in the discriminative power hasa price. Diagnosis must discriminate among 2N behavioralmode assignments when just correct, ok(·), and incorrectmodes, ¬ok(·), are present for N components. When Mbehavioral models are allowed, diagnosis must discriminateamong MN mode assignments. This is the problem facedby any model-based diagnosis proposal which attempts faultidentification [Dressler and Struss, 1996].

Since the pure approach is infeasible in real systems forpractical reasons, many approaches have been proposed inrecent years to deal with the complexity issue. However, tothe best of our knowledge, there is no general architecturesuitable for any kind of system. In fact, many approachesjust perform fault detection and localization, or rely upon acombination of some kind of heuristic, which helps focusingthe diagnosis task. This will be also our approach.

In the recent past, first we proposed a consistency-baseddiagnosis architecture which combined model-based and ex-pert reasoning for diagnosis of industrial continuous systems[Pulido et al., 2001]. In such proposal, we used expert knowl-


edge to derive a classification tree containing temporal infor-mation [Pulido, 2001]. Later on, we were able to deduce aset of rule-based classifiers through machine-learning tech-niques [Pulido et al., 2005]. In this article we introduce apreliminary work for introducing case-base reasoning withinthe proposed architecture.

The organization of this paper is as follows. First, wewill introduce the compilation technique used to performconsistency-based diagnosis, which is the basis for ourmodel-based diagnosis system. Then, we provide a brief sum-mary of case-based reasoning, then introducing the developedcase-based reasoning system. Afterwards, we explain howthe CBR system has been included in our diagnosis architec-ture, and show some results on a case study plant. Finally, wediscuss the results and draw some conclusions.

2 Consistency-based diagnosis using possibleconflict

Model-based diagnosis has been traditionally associated withjust the Control Theory approach usually known as the FDIapproach. Recently, there is a strong on-going research ef-fort to bring a common framework for both the DX and FDIapproaches. It has been defined as the BRIDGE framework[Biswas et al., 2004]. In this context, the work by Cordier etal. [Cordier et al., 2004] has established a common theoret-ical ground where proposals from FDI and DX communitiescan be compared, clearly stating underlying hypotheses. Ourproposal fits in this common framework.

The computation of possible conflicts is a compilationtechnique which, under certain assumptions, is equivalent toon-line conflict calculation in GDE, and off-line generation ofARRs in the Control Theory approach to diagnosis, [Pulidoand Alonso, 2000; Pulido and Alonso Gonzalez, 2004]. Weinclude a brief summary of this approach for the sake of self-containment.

The main idea behind the possible conflict concept is thatthe set of subsystems capable to generate a conflict1 can beidentified off-line. This identification can be done in threesteps.

The first one generates an abstract representation of thesystem, as an hypergraph. In this representation there is onlyinformation about constraints in the models, and their rela-tionship to known and unknown variables in such models.

The second step looks for minimal over-constrained setsof relations, which are essential for model-based diagnosis.These subsystems, called minimal evaluation chains, repre-sent a necessary condition for a conflict to exist. Each min-imal evaluation chain, which is a partial sub-hypergraph ofthe original system description, need to be solved using localpropagation criteria alone.

In the third step, extra knowledge is added to fulfill thatrequirement. We specify each possible way a constraint canbe solved by means of local propagation. As a consequence,each minimal evaluation chain generates a directed and-orgraph. In each and-or graph, a search for every possible way

1The possible conflict would represent, in FDI terms, an ARRwhich could be used for fault detection and isolation.

the system can be solved using local propagation, is con-ducted. Each possible way is called a minimal evaluationmodel, and it can predict the behavior of a part of the wholesystem. Moreover, since conflicts will arise only when mod-els are evaluated with available observations, the set of con-straints in a minimal evaluation model is called a possibleconflict.

Those models can be used to perform fault detection. Ifthere is a discrepancy between predictions from those modelsand current observations, the possible conflict would be re-sponsible for such a discrepancy and should be confirmed asa real conflict. Afterwards, diagnosis candidates are obtainedfrom conflicts following Reiter’s theory.2

A detailed description of consistency based diagnosis withpossible conflicts can be found in [Pulido and Alonso, 2000;Pulido et al., 2001].

3 Case-Based Reasoning3.1 Case-Based Reasoning fundamentalsCase-Based Reasoning (CBR) is a way to solve problems byremembering past similar situations and reusing the informa-tion and knowledge about these situations [Kolodner, 1993].CBR uses the information stored in a case base to infer thesolution for new problems.

CBR proposes a four-step cycle, which Aamod and Plaza[Aamodt and Plaza, 1994] describe as: retrieve, reuse, re-vise, and retain.

The first task in the CBR cycle is to retrieve one or moresimilar cases from the case library where former experiencesare stored. Hence, it is necessary to have a retrieval algorithmand a similarity measure that will be used to bring back a setof similar cases.

Aamodt and Plaza [Aamodt and Plaza, 1994] describe thereuse task focusing on two aspects: first, the differencesamong the past and the current case; second, what part ofthe retrieved case can be transferred to the new case? In somecases the reuse task lies in copying the past solution to thenew case, but in other cases this solution cannot be directlyapplied and has to be adapted.

In the revision task, the solution for the new problem hasto be tested.

Retainment is the last task in the CBR cycle. In this stepthe new case and the solution for this case, obtained in thereuse stage, are stored in order to be used in the future.

3.2 CBR for diagnosisCBR has been used for diagnosis tasks in different domains([Lenz et al., 1998] and [Price, 1999] provide several exam-ples). More precisely we want to do diagnosis as classifi-cation of time series (as in [Colomer et al., 2003]). Evenmore, since we rely upon a model-based diagnosis system forfault detection and localization, we are just concerned with anearly diagnosis problem. In this case, early diagnosis of tem-poral data series means that we must do early fault identifica-tion, i.e. with incomplete data. When symptoms exhibit dif-ferent dynamics, they will manifest at different times. Hence,

2In DX terminology candidates are faulty components.


faults can only be completely identified when the whole seriesare available. Consequently, with incomplete data, differentfaults can be consistent with current observations. In this con-text, CBR must provide a set of feasible faults, just after faultdetection, according to available observations.

Since we want to test a CBR system just as a complemen-tary tool to model-based diagnosis, we have decided to de-velop our own tool. As in any CBR system, we looked fora case structure which appropriately stores knowledge frompast experiences and facilitates each step in the CBR cycle.In our system we must study temporal data series to find theactual fault. Therefore, our cases will be made up of the set ofavailable measurements in a period of time, and a label for theclass of the fault. The source of those data will be describedin section 4.

System architectureCBR is mainly intended for those domains where there is nei-ther explicit knowledge nor reasoning model. Just past expe-riences and their related solutions. To test the approach wewill rely upon simulated faults. The architecture of the sys-tem can be seen in figure 1.

To perform an appropriate classification we have a config-urable tool that simulates the different faults that could hap-pen in the industrial plant and stores this information in a sim-ulation experiment data base. From each experiment we builta WEKA file [Witten and Frank, 2005]. WEKA is widelyused in problem solving in AI. Using these WEKA files, andthe class of the fault, a case is built.

Figure 1: Architecture of the CBR system.

Most CBR systems use a retrieving mechanism based onlocal distance, and most of them use the K-Nearest Neigh-bors algorithm. Applying this algorithm we will also needsome kind of similarity measure. In this work, we havechosen the K-Nearest Neighbors algorithm as retrieving al-gorithm and three different similarity measures: Euclideandistance, Manhattan distance and Dynamic Time Warping(DTW).

DTW [Bellman, 1957; Keogh and Ratanamahatana, 2005]is a technique that allows to obtain a more robust dissimilaritymeasure between two sequences with different lengths whichare not exactly aligned in the time axis.

In the considered application the time series are multivari-ate. That is, we have multi-dimensional series. In order to useDTW with this kind of data we consider an approach for cal-culating the dissimilarity between two multi-dimensional se-ries. We apply DTW for each variable, then, the dissimilaritybetween two multivariate series is calculated as the averageof the dissimilarities for each variable.

In order to work out the distance d(qi, cj) between twopoints qi and cj from the series, we use three different kindof metrics:

• Linear: |qi − cj |• Quadratic: (qi − cj)2

• Valley: 10× (1− exp(−(qi−cj)2

6 ))

The reuse method used in this work is based on the K-nearest neighbors. We can choose the number of neighbors tobe used and adapt the solution of the new case voting amongthe solutions for each retrieved neighbor.

In order to make a better classification, the system appliesleaking and standardization algorithms to the data. Leakingdoes not allow the numerical series to exceed the allowed lim-its. Through standardization we get all the data having meanequals to 0 and variance equals to 1. A detailed descriptionof the CBR system can be found in [Bregon et al., 2006]

4 Case studyThe laboratory plant shown in figure 2 resembles commonfeatures of industrial continuous processes. It is made up offour tanks T1, . . . , T4, five pumps P1, . . . , P5, and twoPID controllers acting on pumps P1, P5 to keep the level ofT1, T4 close to the specified set point. To control tempera-ture on tanks T2, T3we use two resistors R2, R3, respec-tively.

In this plant we have eleven different measurements: levelsof tanks T1 and T4 –LT01, LT04–, the value of the PIDcontrollers on pumps P1, P5 –LC01, LC04–, in-flow ontank T1 –FT01–, outflow on tanks T2, T3, T4 –FT02,FT03, FT04–, and temperatures on tanks T2, T3, T4 –TT02, TT03, TT04–. Action on pumps P2, P3, P4,and resistors –R2, R3– are also known.

This plant can work on different situations. We have de-fined three working situations which are commanded throughfour different operation protocols. In the operation protocolused in this article resistor R3 is switched off, while resistorR2 is on. Also, pumps P3, P4 are switched off; hence, justflow FT01 is incoming to tank T1.

We have used common equations in simulation for thiskind of process.

1. tdm: mass balance in tank t.

2. tdE : energy balance in tank t.

3. tfb: flow from tank t through a pump.

4. tf : flow from tank t through a pipe.

5. rp: resistor failure.

Based on these equations we have found the set of possibleconflicts shown in table 1. In the table, second column showsthe set of constraints used in each possible conflict, which


Figure 2: Diagram of the laboratory plant.

are minimal with respect to the set of constraints. Third col-umn shows the involved components (support for an ARR inBRIDGE terminology, according to Cordier et al. [Cordieret al., 2004]). Fourth column indicates the estimated variablefor each possible conflict.

Constraints Components EstimatePC1 t1dm, t1fb1, t1fb2 T1, P1, P2 LT01PC2 t1fb1, t2dm, t2f T1, T2, P1 FT02PC3 t1fb1, t2dm, r2p T1, P1, T2, R2 TT02PC4 t1fb2, t3dm, t3f T1, P2, T3 FT03PC5 t1fb2, t3dm T1, P2, T3 TT03PC6 t4dm T4 LT04PC7 t4fb T4, P5 FT04

Table 1: Possible conflicts found for the laboratory plant;constraints, components, and the estimated variable for eachpossible conflict.

In the plant we have considered for the current protocol theset of fault modes shown in table 2.

We have also included noise in the measurements, but nosensor failure was studied.

Possible conflicts related to fault modes can be seen in thefault signature matrix shown in table 3.

It should be noticed that these are the fault modes classeswhich can be distinguished for fault identification. In thefault localization stage, the following pair of faults f1, f2,f4, f11, and f3, f12, and f10, f13 can not be separatelyisolated.

Class Component Descriptionf1 T1 Small leakage in tank T1

f2 T1 Big leakage in tank T1

f3 T1 Pipe blockage T1 (left outflow)f4 T1 Pipe blockage T1 (right outflow)f5 T3 Leakage in tank T3

f6 T3 Pipe blockage T3 (right outflow)f7 T2 Leakage in tank T2

f8 T2 Pipe blockage T2 (left outflow)f9 T4 Leakage in tank T4

f10 T4 Pipe blockage T4 (right outflow)f11 P1 Pump failuref12 P2 Pump failuref13 P5 Pump failuref14 R2 Resistor failure in tank T2

Table 2: Fault modes considered.

4.1 Experimental design

The study was made on a data-set, made up of several ex-amples obtained from simulations of the different classes offaults that could arise in the plant. We have modeled eachfault class with a parameter in the [0, 1] range. We have madetwenty simulations for each class of fault. Each simulationlasted 900 seconds. We randomly generate the fault magni-tude, and its origin, in the interval [180, 300]. We also haveassumed that the system is in stationary state before the faultappears.

The data sampling was one data per second. However, dueto the slow dynamics in the plant, we can select one data everythree seconds without losing discrimination capacity. Sincewe just have eleven measures, then each simulation will pro-


f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14

PC1 1 1 1 1 1 1PC2 1 1 1 1PC3 1 1 1 1PC4 1 1 1 1PC5 1 1 1PC6 1PC7 1 1

Table 3: PCs and their related fault modes.

vide eleven series of three hundred numeric elements. More-over, decreasing the sampling we reduce the processing timefor DTW (which is quadratic).

Hence, each case in the case base was made up of a la-bel for the single fault plus eleven data series, one for eachavailable measurement.

4.2 ResultsWe conducted several experiments applying all the describedtechniques for retrieving and reusing.

We applied one-dimensional DTW using different metrics(linear, quadratic and valley), Euclidean distance and Man-hattan distance. Table 4 shows the classification successachieved using one, three and five neighbors for the reusingtask. The results have been obtained using stratified crossvalidation with 10 equally sized subsets from the dataset.

Since we will use the CBR system for early classification,we only show results for 30, 40 and 50% of the data serieslenght. As we stated before, we randomly generate the faultmagnitude in the interval [180, 300]. The simulation lasted900 seconds. Therefore, we will have at least 20% to 33% ofthe data at detection time. In the initial stages of fault isola-tion, the CBR system will be used to focus the consistency-based diagnosis system as described below.

5 Integration proposalConsistency-based diagnosis automatically provides faultisolation based on fault detection results. Using possible con-flicts, consistency-based diagnosis can be easily done withouton-line dependency recording. The proposed diagnosis pro-cess will incrementally generate the set of candidates consis-tent with observations.

In the off-line stage, we initially analyze the system andfind out every possible conflict, pci. Then, we build an exe-cutable model, SDpci

, for each pci.In the on-line stage, we perform a semi-closed loop simu-

lation with each executable model SDpci:

repeat

1. simulate (SDpci , OBSpci) → PREDpci .2. if | PREDpci − OBS′pci

|> δpci confirm pci

as a real conflict.3. update (set of candidates, set of activated pcs)

until every pci is activated or time elapsed.

OBSpcidenotes the set of input observations available for

SDpci; PREDpci

represents the set of predictions obtained

from SDpci ; OBS′pcidenotes the set of output observations

for SDpci ; and δpci is the maximum value allowed as thedissimilarity value between OBS′pci

and PREDpci .Without further information about fault modes,

consistency-based diagnosis will just provide a listof feasible faulty candidates. To improve the accu-racy in our system, in previous works we have reliedupon an induced classifier. Such a classifier providedupon request a ranking of more feasible fault modesbased on current and past data [Alonso et al., 2003;Pulido et al., 2005]. Following a similar approach, we wantto keep logical properties from consistency-based diagnosis.Therefore, we just use CBR as an indication for the mostfeasible fault, as an initial stage to fault identification.

In such sense, the integration of the new CBR system isstraightforward. Every time a conflict detection is done, weincrementally compute the set of candidates, consistent withcurrent observations. Afterwards, we invoke the CBR classi-fier:

4. CLASSIFIER CBR (t0, set of candidates)

and the set of candidates is ranked according to the CBRoutput.

CLASSIFIER CBR(t, c) denotes and invocation to theCBR system with a fragment of series from t to themin(current time, t+maximum series length), and with theset of candidates c. t0 is the starting time of the series, priorto the first conflict confirmation. Once again, the consistency-based diagnosis system will command the diagnosis results,because we just consider CBR output which is consistent withconsistency-based diagnosis results.

6 Discussion and ConclusionsIn this article we have introduced a preliminary work whichexplores the integration of model-based diagnosis and case-based reasoning. The case-base tool could be used in a realcomplex system where no fault model is available, but a col-lection of past fault diagnosis episodes. Currently, we aretesting our approach in a laboratory plant, before we validatethe final diagnosis system in a real continuous plant. Thiswas the main reason for our CBR system to be made up ofsimulated diagnosis cases.

The proposed integration of consistency-based diagnosiswith CBR was rather simple: once the set of candidates isupdated, subsequent calls to the CBR system will provide anorder on the set of feasible fault modes. A similar integrationprocedure was done in previous works, where we used othermachine-learning techniques for inducing classifiers [Pulidoet al., 2005]. Such classifiers were developed using the wholetemporal data series, for the same set of faults. The procedure

CLASSIFIER CBR(t0, set of candidates)

does not need on-line the whole series to provide a reasonableclassification success of 87, 9% using just 40% of the seriesand quadratic DTW in the retrieval stage. This percentageincreases to 91, 1% for 50% of the series. Results with 100%of the data series are 91, 4%.


Retrieving One Neighbor Three Neighbors Five Neighborsdistance 30% 40% 50% 30% 40% 50% 30% 40% 50%

Linear DTW 51.4% 84.3% 87.1% 48.2% 82.1% 83.6% 46.4% 79.3% 81.4%Quadratic DTW 56.1% 87.9% 91.1% 57.9% 84.3% 87.1% 53.2% 83.2% 83.6%

Valley DTW 56.1% 87.9% 91.1% 57.9% 84.3% 87.1% 53.2% 83.2% 83.6%Euclidean 52.1% 83.6% 86.1% 51.4% 80.0% 84.3% 48.6% 78.9% 82.1%Manhattan 49.3% 81.4% 85.4% 44.3% 80.7% 83.6% 42.1% 76.4% 81.4%

Table 4: Classification success using one-dimensional DTW, Euclidean distance and Manhattan distance with one, three andfive neighbors for the reuse task.

Table 5 shows the classification success achieved usingother classifiers for different percentages of the data series:decision trees and support vector machines (SVM, with lin-ear kernel). CBR obtains better results than SVM, but in-duced rule-based classifiers for early diagnosis still providesbetter results with respect to error rate at 30, 40 and 50% ofthe time series. However, it is still necessary to learn a clas-sifier for each percentage of the series. On the other hand,a simple CBR system easily provides a preferred fault mode(or K-preferred modes if we use K-neighbors retrieval algo-rithm).

Technique Classification success30% 40% 50%

Decision trees 68.6% 94.3% 91.8%SVM 44.6% 80.7% 84.6%

Table 5: Classification success using Decision trees andSVM.

Main conclusion is that our CBR system is able to providea valuable ranking of available fault models with 40 and 50%of the data completely needed to identify a fault, assumingfaults arise between 20 and 33% of the whole series. More-over, we just need one classifier for different sizes on the setof data to be classified.

Those two differences can also be applied to other CBRsystems, such as the work by Colomer et al. [Colomer etal., 2003], where there is an abstraction of temporal data inepisodes, and the whole series is needed.

Main advantage of the simple interface proposed, betweenmodel-based diagnosis and CBR approaches, is that we retainthe “consistency” of our model-based diagnosis system. Weuse a semi-closed loop simulation of numerical models overa small period of time, in a noisy environment, with uncer-tainties in the models; we do not perform a crisp comparisonbetween predicted and observed values, but we compare adissimilarity value (a kind of global distance) against a fixedthreshold. We empirically select these thresholds to minimizefault detections.

Concerning completeness, our results depend on the sim-ulation models. We can provide fault detection and isolationcapabilities based just on correct behavior models, due to ourconsistency-based diagnosis approach. For fault identifica-tion we use CBR to rank most feasible fault modes; again wepriorized consistency-based diagnosis results. However, wecan not guarantee the identification of 100% of fault modes

considered. Our best results are 8.6% error for complete faultepisode.

As a future work, we plan to increase the size of the casebase. Our initial guess is that a bigger size will improve theresult on the K-Nearest Neighbors algorithms for more thanone neighbor. Also, we will need to revise the linear struc-ture of the case base. Our intention is to use the combined re-sult from model-based diagnosis and the classifier as an initialclue for the fault identification stage using consistency-baseddiagnosis.

Acknowledgments: this work has been partially fundedby Spanish Ministry of Education and Culture, through grantDPI2005-08498, and Junta Castilla y Leon VA088A05.

References[Aamodt and Plaza, 1994] A. Aamodt and E. Plaza. Case-

Based Reasoning: Foundational Issues, MethodologicalVariations, and System Approaches. AI Communications.IOS Press, Vol. 7: 1,, pages 39–59, 1994.

[Alonso et al., 2003] C. Alonso, J.J. Rodriguez, andB. Pulido. Enhancing consistency-based diagnosis withmachine-learning techniques. In 14th InternationalWorkshop on Principles of Diagnosis, DX03, Washington,D.C. USA, 2003.

[Bellman, 1957] R.E. Bellman. Dynamic Programming.Cambridge Studies in Speech Science and Communica-tion. Princeton University Press, Princeton, 1957.

[Biswas et al., 2004] G. Biswas, M.O. Cordier, J. Lunze,L. Trave-Massues, and M. Staroswiecki. Diagnosis ofcomplex systems: bridging the methodologies of the FDIand DX communities. IEEE Trans. on Systems, Man, andCybernetics. Part B: Cybernetics, 34(5):2159–2162, 2004.

[Bregon et al., 2006] A. Bregon, M.A. Simon, J.J.Rodrıguez, C. Alonso, B. Pulido, and I. Moro. Earlyfault classification in dynamic systems using case-basedreasoning (accepted; to be published in Lecture Noteson Artificial Intelligence, draft available upon request).In Post-proceedings 11th Conference of the SpanishAssosiation of Articial Intelligence, CAEPIA’05, Santiagode Compostela, Spain, 2006.

[Colomer et al., 2003] J. Colomer, J. Melendez, andF. Gamero. A qualitative case-based approach for sit-uation assessment in dynamic systems. application in atwo tank system. In Proceedings of the IFAC-Safeprocess2003, Washington, USA, 2003.


[Cordier et al., 2004] M.O. Cordier, P. Dague, F. Levy,J. Montmain, M. Staroswiecki, and L. Trave-Massuyes.Conflicts versus analytical redundancy relations: a com-parative analysis of the model-based diagnosis approachfrom the artificial intelligence and automatic control per-spectives. IEEE Trans. on Systems, Man, and Cybernetics.Part B: Cybernetics, 34(5):2163–2177, 2004.

[de Kleer and Williams, 1989] J. de Kleer and B.C.Williams. Diagnosing with Behavioral Modes. InProceedings of the Eleventh International Joint Con-ference on Artificial Intelligence (IJCAI-89), Detroit,Michigan, USA, 1989.

[Dressler and Struss, 1996] O. Dressler and P. Struss. Theconsistency-based approach to automated diagnosis of de-vices. In Gerhard Brewka, editor, Principles of Knowl-edge Representation, pages 269–314. CSLI Publications,Standford, 1996.

[Dressler, 1996] O. Dressler. On-line diagnosis and monitor-ing of dynamic systems based on qualitative models anddependency-recording diagnosis engines. In Proceedingsof the Twelfth European Conference on Artificial Intelli-gence (ECAI-96), pages 461–465, 1996.

[Friedrich et al., 1990] G. Friedrich, G. Gottlob, and W. Ne-jdl. Physical impossibility instead of fault models. InProceedings of the Eighth National Conference on Arti-ficial Intelligence, pages 331–336, Boston, Massachusetts,USA, 1990.

[Hamscher et al., 1992] W. Hamscher, L. Console, andJ. de Kleer (Eds.). Readings in Model-based Diagnosis.Morgan-Kaufmann Pub., San Mateo, 1992.

[Keogh and Ratanamahatana, 2005] E. Keogh and C. A.Ratanamahatana. Exact indexing of dynamic time warp-ing. Knowledge and Information Systems, 7(3):358–386,2005.

[Kolodner, 1993] J. Kolodner. Case-Based Reasoning. Mor-gan Kaufmann Publishers, 1993.

[Lenz et al., 1998] M. Lenz, M. Manago, and E. Auriol. Di-agnosis and decision support. In Case-Based ReasoningTechnology, From Foundations to Applications, pages 51–90. Lecture Notes In Computer Science; Vol. 1400, 1998.

[Poole, 1989] D. Poole. Normality and faults in logic-based diagnosis. In Proceedings of the Eleventh Interna-tional Joint Conference on Artificial Intelligence (IJCAI-89), pages 1304–1310, Detroit, Michigan, USA, 1989.

[Price, 1999] C. Price. Computer-based diagnostic systems.Springer Verlag, New York, 1999.

[Pulido and Alonso Gonzalez, 2004] B. Pulido andC. Alonso Gonzalez. Possible conflicts: a compila-tion technique for consistency-based diagnosis. IEEETrans. on Systems, Man, and Cybernetics. Part B:Cybernetics, 34(5):2192–2206, 2004.

[Pulido and Alonso, 2000] B. Pulido and C. Alonso. Analternative approach to dependency-recording engines inconsistency-based diagnosis. In Artificial Intelligence:

Methodology, Systems, and Applications. AIMSA-00, vol-ume 1904 of LNAI, pages 111–120. Springer Verlag,Berlin, Germany, 2000.

[Pulido et al., 2001] B. Pulido, C. Alonso, and F. Acebes.Lessons learned from diagnosing dynamic systems usingpossible conflicts and quantitative models. In Engineeringof Intelligent Systems. XIV Conf. IEA/AIE-2001, volume2070 of LNAI, pages 135–144, Budapest, Hungary, 2001.

[Pulido et al., 2005] B. Pulido, J.J. Rodriguez Diez,C. Alonso Gonzalez, O. Prieto, E. Gelso, and F. Acebes.Diagnosis of continuous dynamic systems: integratingconsistency-based diagnosis with machine-learning tech-niques. In XVI IFAC World Congress, 2005, Prague,Zcheck Republic, 2005.

[Pulido, 2001] B. Pulido. Possible conflicts as an alterna-tive to on-line dependency-recording for diagnosing con-tinuous systems (in Spanish). Ph.D., E.T.S.I. Informatica.Universidad de Valladolid, Valladolid, Febrary 2001.

[Raiman et al., 1991] O. Raiman, J. de Kleer, V. Saraswat,and M. H. Shirley. Characterizing non-intermittent faults.In Proceedings of the Ninth National Conference on Ar-tificial Intelligence, pages 849–854, Anaheim, California,USA, 1991.

[Struss and Dressler, 1989] P. Struss and O. Dressler. Phys-ical negation: Introducing fault models into the generaldiagnostic engine. In Proceedings of the Eleventh Interna-tional Joint Conference on Artifical Intelligence (IJCAI-89), pages 1318–1323, Detroit, Michigan, USA, 1989.

[Witten and Frank, 2005] I. Witten and E. Frank. Data Min-ing: Practical Machine Learning Tools and Techniques(Second Edition). Morgan Kaufmann, 2005.


Abstract In this paper, a symbolic procedure for ambiguity groups determination, based on the a priori identification concept, is proposed. The method starts from an analysis of the occurrence of parameters in the input/output relationship coefficients in order to select the potential canonical ambiguity groups. This first step allows one to strongly reduce the problem complexity. Then, the nonlinear system obtained imposing the ambiguity conditions is solved, resorting to Gröbner bases theory. These procedures are completely symbolic, then they do not cause round-off errors. Furthermore, this approach overcomes the limits of the procedures presented in literature that can be directly applicable only to linear circuits. In fact, the method can be applied both to linear circuits and to nonlinear circuits with algebraic nonlinearity. Examples of application regard a well known linear circuit benchmark and a third-order chaotic circuit.

1 Introduction In recent years, there has been a growing interest on automatic procedures for analog circuits fault diagnosis [Catelani et al., 1987][Stenbakken et al., 1989][Carmassi et al., 1991][Liberatore et al., 1994][Fedi et al., 1998][Fedi et al., 1999][Starzyk et al., 2000][Cannas et al., 2004]. To perform this task it is crucial to have a quantitative measure of circuit solvability, i.e., of solvability of the non linear fault diagnosis equations. The testability concept is strictly linked to the concept of circuit solvability. The most popular definition of analog circuit testability has been introduced in [Sen and Saeks, 1979][Chen and Saeks, 1979]. This definition provides a quantitative measure of network testability, i.e., a measure of solvability of fault equations. Apart from which actual method is used to write the fault equations, the testability of a circuit gives the maximum number of faults that an hypothetic diagnostic system could detect, if they are simultaneously present in the circuit. Therefore, the results of the testability analysis fixe an upper limit to the information that both an FDI and a DX [Cordier

et al., 2004] approach can give, regarding the fault state of a circuit. For low testability circuits, an important concept is that of ambiguity groups. Roughly speaking, an ambiguity group is a set of components that, if considered as potentially faulty, do not give an unique solution in phase of fault location. An ambiguity group that does not contain other ambiguity groups is said Canonical Ambiguity Group (CAG) [Fedi et al., 1998]. One of the first algorithms for CAGs determination has been presented in [Stenbakken et al., 1989]. Several numerical algorithms and symbolic procedures for evaluating circuit testability and CAGs in case of linear circuits have been developed by several authors [Catelani et al., 1987][Stenbakken et al., 1989][Starzyk et al., 2000]. However, few contributions are present in literature regarding nonlinear circuit testability. In [Fedi at al., 1998b] a symbolic approach for testability evaluation of nonlinear analog circuits is presented. The approach is an extension of the methodologies developed for the linear case to circuits where nonlinear components, such as diodes or transistors, are present. In practice, the procedure for determining the testability value and the ambiguity groups resort to the substitution of each nonlinear component with its piece wise linear model. It is worth pointing out that the network functions determined in the s domain have not any meaning and they cannot be used in phase of fault location in order to determine the effective value of the circuit components. They can give information only about the testability and ambiguity group determination.

In [Worsman and Wong, 2000] the ambiguity analysis in the dc and ac domains are conducted independently of one another and the results are combined. In the first case, the non linear components are replaced with PWL models whereas, in the ac case, a small signal analysis is performed.

This paper proposes a symbolic procedure for ambiguity groups determination, based on the a priori identification concept, that can be applied to both linear and non linear circuits. The method starts from an analysis of the occurrence of parameters in the input/output relationship coefficients in order to select the potential CAGs. This first step allows one to strongly reduce the problem complexity. Then, for each potential CAG, the nonlinear system

Ambiguity Groups Determination for Analog Non Linear Circuits Diagnosis

Barbara Cannas, Alessandra Fanni, and Augusto Montisci Dipartimento di Ingegneria Elettrica ed Elettronica

University of Cagliari Piazza d’Armi, 09123 - Cagliari, Italy

cannas; fanni; [email protected]


obtained imposing the ambiguity conditions is solved, resorting to Gröbner bases [Becker and Weispfenning, 1993].

2 Theory Let us consider an analog, time-invariant, nonlinear circuit described by:

[ ])]([

),(),(

thy

tutfdtd

x

Fxx

=

= (1)

where x is the n-dimensional state variable vector, u is a scalar input, F=F1, F2,…, FL is the vector of the parameters, L the number of components, y a scalar output, and f and h, are algebraic function vectors in x. We assume that only one scalar output is measured as y(t) (i.e., the state is not directly available).

Literature reports several definition of CAGs for linear circuits [Fedi et al., 1998][Cannas et al., 2005].

By the first definition of CAG [Fedi et al., 1998], d components F1, F2,…, Fd belong to a CAG if a variation of d1<d components causes a variation on the transfer function, indistinguishable with respect to the one produced by the variation of the other d-d1 components of the group, i.e.,

)( ) ( ddd F, , FHF,, FH ∆…∆∆=∆…∆∆ +1211 11 (2)

In this paper, an equivalent definition of CAG is adopted, (see [Cannas et al., 2005] for the proof of the equivalence of the two definitions).

Definition: d components F1, F2,…, Fd belong to a CAG if a variation ∆F ≠ 0 exists that does not affect the transfer function value, i.e., )( ) ( 11211 ddd F, F, FFHF,, FH ∆+…∆+=…

For example, two components belong to a CAG if, for every deviation from the nominal value in one parameter, there exists a deviation in the other, which perfectly balances it so as to have no effect on the transfer function.

In case of nonlinear circuit diagnosis, a definition of CAG is necessary that doesn’t resort to assumptions on the transfer function, only defined for linear systems. The most natural way is to refer to ‘the Input/Output Relationship’ (IOR).

Definition 1: d components F1, F2,…, Fd constitute a CAG iff a variation ∆F ≠ 0 exists such that the IOR does not globally change.

Generally speaking, if d components constitute a CAG,

the corresponding system is not identifiable. In fact, system (1) is identifiable through IOR if:

IOR(F1, F2,…, Fk) = IOR(F’1, F’2,…, F’k) iff (F1, F2,…, Fk) = (F’1, F’2,…, F’k)

for a finite number of sets F’1, F’2,…,F’k, i.e., if the components F1, F2,…, Fk does not belong to a CAG.

For a system in the form (1), the IOR is a non linear differential polynomial in u, y and their derivatives and can be expressed as:

( ) 0,...,...,,,,...,,,),( 21 == aauuuyyyzyuz &&&&&&

where the ai are the input/output relationship coefficients and they depend on the parameter vector F.

The IOR can be determined by resorting to the concept of ‘Characteristic Set’ associated to the dynamic state equations. The Characteristic Set was introduced by Ritt [Ritt, 1950] in 1950, and since 1990 it has been widely used for the study of dynamic systems [Ljung and Glad, 1994]. In order to define the Characteristic Set, we need to introduce some concepts of Differential Algebra [Fliess and Glad, 1993]. The peculiarity of the Characteristic Set is that it summarizes all the information contained in the differential equations defining a dynamic system. If one chooses the ranking of the variables and their derivatives:

......... 2121 xxyxxyuuu &&&&&& <<<<<<<<<<

for a system of the form (1), the Characteristic Set exhibits n+1 differential polynomials, that is: • the IOR denoted by z(u, y); • n differential polynomials, in y, x and u, denoted by the n-dimensional vector Z(u, x, y).

The procedure to identify the IOR may turn out to be rather complex, hence resorting to a software to calculate the Characteristic Set is mandatory. However, in some cases, it can be easily detected by simple inspection and manipulation of the dynamic equations.

In this paper, the commercial tool REDUCETM has been used and the implemention of the Ritt’s algorithm described in [Audoly et al., 2001].

3 Determination of Canonical Ambiguity Groups

A preliminary analysis of the IOR coefficients allows one to a priori exclude that a group of components constitutes a CAG. For example, two components F1 and F2 cannot constitute a CAG if there is an IOR coefficient such that only one of them (e.g., F1) appears. In fact, in this case, a variation of F2, cannot produce the same effect of a variation of F1.

Let us consider a circuit composed by K components F1,…FK , and let aj , for j=1,…..,H be the IOR coefficients. An incidence matrix Y can be written as follows:

=

HKK

H

K

H

yy

yy

F

F

aa

,1

,1111

1

K

MM

K

MY

where:


=otherwise 0

t coefficien in the appears component theif 1 jiyij

i=1,…K; j=1,…H.

The incidence matrix inspection yields a set of groups that could constitute a CAG.

In general, a set of components could constitute a CAG only if at least two of them are present in the coefficients where they appear, i.e., if:

HjFiy ii

ij ,,1CAG:1 K=∈≠∑ (3)

It is worth noting that, if

0,...,1

=∑= Hj

ijy

the component Fi is not diagnosable at all.

For the potential CAGs, further analysis is necessary. Let us consider d components that could belong to a CAG of order d. They actually constitute a CAG iff a variation ∆Fi , i = 1,…, d of these components does not modify the IOR (see definition 1). This is equivalent to solve a problem of constrained identifiably where the constraints consist of fixing the set of known parameters corresponding to the faulty-free components. This problem can be expressed by a system of g equations for each CAG:

( ) ( )

⊄∀=′

=′′=

CAGsubject to

,...1,...,,..., 11

jjj

kiki

FFF

giFFaFFa (4)

where ai is the generic coefficient containing at least two components of the CAG, g is the number of such coefficients, and iii FFF ∆+=' . If the d components constitute a CAG, the corresponding system where the d components are the unknowns, is not identifiable. In particular, the d components constitute a CAG iff the algebraic system (4) has infinite solutions. If the system (4) admits a finite number of solutions, this means that only a finite number of parameters vectors cannot be distinguished each other. Such situation is not

considered as ambiguous, because the probability to have exactly the ambiguous vector of parameters values is virtually null. Furthermore, if such an ambiguity exists, and the two undistinguishable vectors correspond to a fault and to a healthy configuration, one can consider as faulty the configuration that is not. In this work the system (4) has been solved resorting to the Gröbner bases theory [Becker and Weispfenning, 1993]. This is the origin of many symbolic algorithms used to manipulate multiple variable polynomials. In particular, the commercial tool REDUCETM has been used that implements the Buchberger’s algorithm [Buchberger, 1985]. It makes use of a generalization of Gauss’ elimination method for multivariable linear equations and of Euclide’s algorithm for one variable polynomial equations.

4 Examples In this section the procedure is firstly applied to a linear circuit and then to a nonlinear circuit, both being well known benchmarks. In both cases, we assume the IOR of the circuit is available. The analysis we perform concerns the coefficients of said IOR, which we assume can be deduced from a set of observations. The method that one uses to explicitly describe the relationship between the IOR coefficients and the observations is out of the scope of the present work, and it does not affect the generality of the approach described here.

In this perspective, the equations just described are the Analytical Redundancy Relations (ARR) [Cordier et al., 2004] of the diagnostic system. Furthermore, because of the IOR coefficients are described in terms of parameters of the circuit, it is possible to write the Signature Matrix for single faults [Cordier et al., 2004], which gives the information on the fact that a certain parameter affects the residual of a particular ARR. In fact, as the dependence of the IOR coefficient by the parameters is explicit, the Single Fault Signature Matrix coincides with the Incidence Matrix of the parameters vs the IOR coefficients.

4.1 Sallen-Key band-pass filter The proposed procedure is firstly described by means of an example retrieved from the literature [Fedi et al., 1999][Cannas et al., 2004]. In particular the Sallen-Key band-pass filter, shown in Fig. 1, has been considered. The IOR corresponding to the input/output voltages shown in Fig. 2 is:

uadt

yddtdyaya 32

2

21 =++ (5)

where:

( )

( )

( )5425

13

2121

3

25

42512

2121

31

GGCG

Ga

CCCC

GCG

GGGGa

GGCC

Ga

+=

++

−=

+=

1G

2G

1C

3G2C5G

4Ginv

outv

1G

2G

1C

3G2C5G

4Ginv

outv

Figure 1- The Sallen-Key band-pass filter


In [Fedi et al., 1999] the following CAGs are reported for this circuit: C1, G2, G3, G4, G5. The incidence matrix Y is:

=

111110101101110110111

2

1

5

4

3

2

1

321

CCGGGGG

aaa

Y (6)

Since the IOR has 3 coefficients, the analysis will be performed only for groups of order 2 and 3. In fact, all the groups of order greater than 3 contain at least one CAG.

We shall firstly evaluate the groups of order 2. On the basis of the matrix Y inspection it is possible to a priori exclude the ambiguity for a set of couples of components. For example, G1 and G2 cannot constitute a CAG because in the incidence matrix it results (see Eq. (6)):

12111 =+ yy

The same reasoning allows one to reduce the potential ambiguity groups to the 6 couples indicated by a ‘?’ in Table I. Each couple constitutes a CAG, iff the corresponding system (4) has infinite solutions. For example, let us consider the couple (G1 , C2). The system (4) consists of the following g = 3 equations:

( ) ( )

( ) ( )

( ) ( )

′+′

+′

−′=++

−

+′′

=+

+′

′=+

2121

3

25

425121

21

3

25

4251

2121

321

21

3

5425

154

25

1

CCCC

GCG

GGGGCCCC

GCG

GGGG

GGCC

GGGCC

G

GGCG

GGGCG

G

(7)

where G’1 and C’2 are the fault values and they represent the system unknowns. The system (7) admits only one solution, G’1=G1 ; C’2=C2. Thus, G1 and C2 do not constitute a CAG.

Let us now consider the couple G4 , G5. The system (4) consists of the following g = 2 equations:

( ) ( )

( ) ( )2121

3

25

425121

21

3

25

4251

5425

154

25

1

CCCC

GCG

GGGGCCCC

GCG

GGGG

GGCG

GGGCG

G

++′

′−′=++

−

′+′′

=+ (8)

where G’4 and G’5 are the fault values and they represent the

system unknowns. The system (8) admits infinite solutions:

5

544 G

GGG′

=′ (9)

Thus, G4 and G5 constitute a CAG. By performing this analysis for all the 6 couples in Table I, it results that the couple G4 , G5 is the only CAG of order 2.

Now let us consider the groups of order 3. On the basis of the matrix Y inspection it is possible to a priori exclude the ambiguity for a set of terns of components. For example, G1 , G2 , G3 cannot constitute a CAG because (see Eq. (6)): 1312111 =++ yyy

Let us consider the component G2. The potential CAGs of order 3 that contain the component G2 are shown in Table II. Let us now consider the tern G2 , G1 , G4. The system (4) consists of the following g = 3 equations:

( ) ( )

( ) ( )

( ) ( )

++′′−′

=++−

′+′=+

+′′

=+

2121

3

25

425121

21

3

25

4251

2121

321

21

3

5425

154

25

1

CCCC

GCG

GGGGCCCC

GCG

GGGG

GGCC

GGGCC

G

GGCG

GGGCG

G

(10)

where G’2, G’1 and G’4 are the fault values and they represent the system unknowns. The system (10) admits 2 solutions, thus, the group G2 , G1 , G4 is not a CAG.

Let us now consider the tern G2 , G3 , C1. The system (4) consists of the following g = 2 equations:

( ) ( )

( ) ( )

+′′′

+′−

=++−

′+′′′

=+

2121

3

25

425121

21

3

25

4251

2121

321

21

3

CCCC

GCG

GGGGCCCC

GCG

GGGG

GGCC

GGGCC

G (11)

where G’2, G’3 and C’1 are the fault values and they represent the system unknowns. The system (11) admits

TABLE I. SALLEN-KEY FILTER- POTENTIAL CAGS OF ORDER 2

G1 G2 G3 G4 G5 C1 C2

G1 - - - - - ? G2 ? - - ? - G3 - ? ? - G4 ? - - G5 - - C1 -

TABLE II. SALLEN-KEY FILTER- POTENTIAL CAGS OF ORDER 3

G1 G2 G3 G4 G5 C1 C2

G1 G2 - ? ? - ? G1 G3 ? ? - ? G1 G4 - ? ? G1 G5 ? ? G1 C1 ? G2 G3 - - ? - G2 G4 - - ? G2 G5 - ? G2 C1 - G3 G4 - - ? G3 G5 - ? G3 C1 - G4 G5 - - G4 C1 ? G5 C1 ?


infinite solutions:

′

=′′+−

=′3

311

4

5353422 ;

GGCC

GGGGGGGG (12)

Thus, the group G2 , G3 , C1 is a CAG. By performing this analysis for the other potential CAGs, it results that the group G2 , G3 , C1 is the only CAG of order 3.

The system (9) as the (12) gives us an important information. In fact, they express the relationship between the fault values that can be confused. So, this information could be combined with further constraints of the actual diagnostic problem, in order to reduce the uncertainty of the diagnosis. For example, if the ambiguity exists for a couple of values, one of them is negative, in practice there is no ambiguity. This result represents an important improvement with respect to the previous approaches, where the fault equations system solvability is evaluated by means of the analysis of the Jacobian matrix [Catelani et al., 1987][Stenbakken et al., 1989][Carmassi et al., 1991][Liberatore et al., 1994][Fedi et al., 1998][Fedi et al., 1999][Starzyk et al., 2000][Cannas et al., 2004]. In this case, no information can be deduced on the relationship between the parametr values for indistinguishable faults.

4.2 Chua’s circuit Chua's circuit (see Fig. 2) is a simple electronic circuit that exhibits classic chaos theory behaviour. First introduced in 1983 by Leon O. Chua [Chua, 1993], its ease of construction has made it an ubiquitous real-world example of a chaotic system, leading some to declare it 'a paradigm for chaos'. Chua's circuit consists of two linear capacitors, one linear resistor, one linear inductor and one nonlinear resistor. The nonlinear resistor (Chua's diode) is chosen to have a cubic V/I characteristic of the form [Huang et al., 1996]:

331 vvi ⋅+⋅= γγ

The IOR by considering the voltage v1 as the output is:

( )

033

61

31

21

3121

21

32

12

21

1

3

21

11

31

21

112

12

21

123

13

=+++

+

++

++

+

vCLCdt

dvv

CRCdtvd

vC

dtdv

vCdt

dvCRCRCL

dtvd

CRCRC

dtvd

γγγ

γγγ (13)

Let us call ai with i=1,…,7 the coefficients of the equation (13). Then the incidence matrix Y will be:

=

111100000001111000110110011111111110100111

3

1

2

1

7654321

γγL

CCR

aaaaaaa

Y (14)

The coefficients are one more than the parameters, then all the possible groups of faults have to be considered as potential canonical ambiguity groups. By following the same procedure used in the linear example, simply on the basis of the incidence matrix (14), it is possible to exclude the presence of any CAGs of order 2.

Now let us consider the potential CAGs of order 3. The analysis of the Y matrix yields the result reported in table III. Only five potential CAGs of order 3 have to be examined: R, C1, γ3, R, C2, L, R, C2, γ3, C1, C2, γ3, C1, γ1, γ3. On the basis of the Gröbner analysis, none of these groups results to be an ambiguity group, as the corresponding equations systems have only one solution.

By applying the analysis of the Y matrix, the following potential ambiguity groups of order greater than 3 are obtained: R, C1, C2, γ3, R, C1, L, γ3, R, C1, γ1, γ3, R, C2, L, γ1, C1, C2, L, γ3, C1, C2, γ1, γ3, C1, L, γ1, γ3 of order 4, R, C1, C2, L, γ3, R, C1, C2, γ1, γ3, C1, C2, L, γ1, γ3 of order 5, R, C1, C2, L, γ1, γ3 of order 6. The Gröbner analysis has been applied to the corresponding 11 equations systems, and for each of them we obtained only one solution. Thus, the Chua’s circuit has no ambiguity groups, then, theoretically, whichever variation of one or more parameters of the circuit, can be univocally determined.

5 Comments and conclusion In this paper, a symbolic procedure for canonical ambiguity groups determination is proposed, resorting to the philosophy of a priori identifiability.

The method starts with the IOR determination; then an analysis of the occurrence of parameters in the input/output relation coefficients is performed in order to select the potential CAGs. In the last step, the corresponding parameter identifiability is investigated.

The IOR determination is performed resorting to an implemention of the Ritt’s algorithm. The main drawback of the algorithm is that it is directly applicable only to

Li3

C2

R

C1v1v2

igNgL

i3C2

R

C1v1v2

igNg

Figure 2- Chua’s circuit

TABLE III CHUA’S CIRCUIT- POTENTIAL CAGS OF ORDER 3

R C1 C2 L γ1 γ3

R C1 - - - ? R C2 ? - ? R L - - R γ1 - C1 C2 - - ? C1 L - - C1 γ1 ? C2 L - - C2 γ1 - L γ1 -


algebraic systems; some transcendent systems (e.g., systems with trigonometric or exponential non linear functions) can be analysed resorting to suitable transformations that cancel the nonlinearity. Thus, the method cannot be used to determine the CAGs in circuits with non linear components described by piecewise linear, or by some transcendent functions.

The identifiability problem has been solved resorting to the the Buchberger’s algorithm for the calculation of the Gröbner bases. The limit of this algorithm is that it is only applicable to systems with rational algebraic coefficients.

An analysis of computational complexity has not been performed to date. It is worth noting that the complexity strongly depends on the chosen ordering of the variables both for the Ritt’s algorithm and the Buchberger’s algorithm. Moreover, a bound for the complexity is not available. The main advantage of the method is that it is independent on the fault value, since it does not introduce any linear approximation, as done in the majority of symbolic approaches proposed in the literature. Moreover, the analysis of the IOR coefficients instead of the transfer function coefficients, allows one to extend to nonlinear systems, the applicability of several methods, presented in literature only for linear systems.

Future work regards the comparison with the methods presented in literature in terms of complexity and the extension of the method to a wider class of nonlinear circuits.

References [Catelani et al., 1987] M. Catelani, G. Iuculano, A.

Liberatore, S. Manetti, and M. Marini, “Improvements to numerical testability evaluation,” IEEE Trans. Instrum. Meas., vol. 36, pp. 902-907, Dec. 1987.

[Stenbakken et al., 1989] G. N. Stenbakken, T. M. Souders, and G. W. Stewart, “Ambiguity groups and testability”, IEEE Trans. Instrum. Meas., vol. 38, pp. 941-947, October 1989.

[Carmassi et al., 1991] R. Carmassi, M. Catelani, G. Iuculano, A. Liberatore, S. Manetti, and M. Marini, “Analog network testaility measurement: a symbolic formulation approach,” IEEE Trans. Instrum. Meas., vol. 40, pp. 930-935, Dec. 1991.

[Liberatore et al., 1994] A. Liberatore, S. Manetti, and M.C. Piccirilli, “A new efficient method fo analog circuit testability measurement,” IEEE Instrumentation and Measurement Technology Conference (IMTC), Hamamatsu, May 10-12, 1994, pp. 193-196.

[Fedi et al., 1998] G. Fedi, A. Luchetta, S. Manetti, and M. C. Piccirilli, “A new symbolic method for analog circuit testability evaluation,” IEEE Trans. Instrum. and Meas., vol. 47, pp. 554 565 April 1998.

[Fedi et al., 1999] G. Fedi, S. Manetti, M.C. Piccirilli, and J. Starzyk, “Determination of an optimum set of testable components in the fault diagnosis of analog linear

circuits,” IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, vol. 46, pp. 779–787, July 1999.

[Starzyk et al., 2000] J. A. Starzyk, J. Pang, S. Manetti, M.C. Piccirilli, and G. Fedi, “Finding ambiguity groups in low testability analog circuits,” IEEE Trans. Circuits and Systems - I: Fundamental Theory and Applications, vol. 47, pp. 1125-1137, August 2000.

[Cannas et al., 2004] B. Cannas, A. Fanni, S. Manetti, A.Montisci, M.C. Piccirilli, “Neural Network Based Analog Fault Diagnosis Using Testability Analysis,” Neural Computing & Applications Journal, Springer, Neuro computing and Applications, pp. 258, vol.13, 2004.

[Cordier et al., 2004] M.O. Cordier, P. Dague, F. Lévy, J. Montmain, M. Staoswiecki, and L. Travé-Massuyès, “Conflicts Versus Analytical Redundancy Relations: A Comparative Analysis of the Model Based Diagnosis Approach From the Artificial Intelligence and Automatic Control Perspective,” IEEE Trans. Sys., Man, Cyber. – Part B: Cybernetics, Vol. 34, No 5, October 2004.

[Sen and Saeks, 1979] N. Sen, and R. Saeks, “Fault diagnosis for linear systems via multifrequency measurements,” IEEE Trans. Circuits and Systems, vol. 26, pp. 457-465, July 1979.

[Chen and Saeks, 1979] H. Chen, and R. Saeks, “A search algorithm for the solution of the multiferquency fault diagnosis equations,” IEEE Trans. Circuits and Systems, vol. 26, pp. 589-594, July 1979.

[Becker and Weispfenning, 1993] T. Becker, and V. Weispfenning, Gröbner Bases: A Computational Approach to Commutative Algebra. New York: Springer-Verlag, 1993.

[Buchberger, 1985] B. Buchberger, Gröbner bases: An algorithmic method in polynomial ideal theory. Multidimensional Systems Theory, N. K. Bose, ed., D. Reidel Publishing Co., 1985, pp. 184-232.

[Fedi et al., 1997] G. Fedi, R. Giomi, A. Luchetta, S. Manetti, and M.C. Piccirilli, "Symbolic algorithm for ambiguity group determination in analog fault diagnosis." In Proc. of European Conference on Circuit Theory and Design (ECCTD'97), Budapest, Hungary, Aug. 1997, pp. 1286-1291.

[Manetti and Piccirilli, 2003] Manetti S., Piccirilli M., “A singular-value decomposition approach for ambiguity group determination in analog circuits,” IEEE Trans. on Circuits and Systems I-Fundamental theory and applications, vol. 50, n. 4, pp. 477-487, 2003.

[Ljung and Glad, 1994] L.Ljung and S.T. Glad, “On global identifiably for arbitrary model parameterization,” Automatica, vol. 30, pp. 265-276, 1994.

[Fliess and Glad, 1993] M. Fliess and S. T. Glad, “An algebraic approach to linear and nonlinear control,” Essays on Control: Perspectives in the Theory and its


Applications, Groningen, vol. 14 of Progr. Systems Control Theory, Boston, Birkhauser Boston, pp. 223—265, 1993.

[Ritt, 1950] F. Ritt, Differential Algebra, American .Mathematical Society Colloquium Publications, vol. 33, 1950.

[Worsman and Wong, 2000] M. Worsman, M. W. T. Wong, “Non-linear analog circuit fault diagnosis with large change sensitivity,” International Journal on circuit theory and applications, vol. 28, pp. 281–303, 2000.

[Fedi at al., 1998b] G. Fedi, R. Giomi, S. Manetti, M.C. Piccirilli, “A Symbolic Approach for Testability Evaluation in Fault Diagnosis of Nonlinear Analog Circuits,” 1998 IEEE ISCAS, June 1998, Monterey, USA.

[Cannas et al., 2005] B. Cannas, A. Fanni, and A. Montisci, “Testability Evaluation for Analog Linear Circuits Via Transfer Function Analysis”, Proc. of IEEE International Symposium on Circuits and Systems (ISCAS), Kobe, Japan, May 23-26, 2005.

[Audoly et al., 2001] S. Audoly, G. Bellu, L. D’Angiò, M. P. Saccomani, and C. Cobelli, “Global identifiability of nonlinear models of biological systems,” IEEE Trans. Biomed. Eng., vol. 48, pp. 55–65, Jan. 2001.

[Huang et al., 1996] A. Huang, L. Pivka, C.W. Wu, M.Franz “Chua’s equation with cubic nonlinearity,” International Journal of Bifurcation and Chaos 1996; 6:2175 –2222.

[Chua, 1993] L. O. Chua, “Global unfolding of Chua’s circuit”, IEICE Trans. Fundamentals-I, pp. 704-734, 1993.


A Framework for Decentralized Qualitative Model-Based Diagnosis

Luca Console, Claudia Picardi Daniele Theseider Dupre

Universita di TorinoDipartimento di Informatica

Corso Svizzera 185, 10149, Torino, Italylconsole,[email protected]

Universita del Piemonte OrientaleDipartimento di Informatica

Via Bellini 25/g, 15100, Alessandria, [email protected]

Abstract

In this paper we propose a framework for decentral-ized model-based diagnosis of complex systems.We consider the case where subsystems are devel-oped independently along with their associated (orembedded) software modules – in particular theirdiagnostic software. This is useful in those situa-tions where subsystems are developed (possibly bydifferent suppliers) without a-priori knowledge ofthe system in which they will be exploited, or with-out making assumptions on the role they will playin such system.We describe a decentralized architecture wheresubsystems are analyzed by Local Diagnosers, co-ordinated by a Supervisor. Within the framework,both the Local Diagnosers and the Supervisor canbe designed independently of each other, withoutany advance information on how the subsystemswill be connected (provided that they share a com-mon modeling ontology) and allowing also for run-time changes in the overall system structure. Localdiagnosers are thus loosely coupled and communi-cate with the Supervisor via a standard interface,supporting independent implementations.

1 IntroductionThe application of model-based diagnosis has often to facethe problem of architectural complexity in real physical sys-tems. Decomposition has been recognized as an importantleverage to manage this type of complexity. Most of the ap-proaches, however, focused on hierarchical decomposition[Genesereth, 1984; Mozetic, 1991], while decentralizationhas been explored less frequently (see e.g. [Pencole andCordier, 2005]).

In this paper we propose a decentralized supervised ap-proach to diagnosis. This approach generalizes the one in[Ardissono et al., 2005], which is tailored to the specificcase of diagnosis of Web Services. We assume that a sys-tem is formed by subsystems each one having its Local Diag-noser, while a Supervisor is associated with the overall sys-tem, with the aim of coordinating Local Diagnosers and pro-ducing global diagnoses.

Although the details of the approach will be discussed inthe following sections, it is important to notice here that weconsider a general case where:• The Supervisor needs not have a-priori knowledge about

the subsystems and their paths of interaction.

• Each Local Diagnoser can adopt its own diagnostic strat-egy and must only implement a communication interfacewith the Supervisor.

• Local Diagnosers don’t know each other.

• The Supervisor should behave as a Local Diagnoser incase the system is a subsystem of a more complex sys-tem (hierarchical scalability).

Several reasons motivate the adoption of such an approach.First of all, it is suitable for distributed systems where theset of subsystems and their paths of interaction vary acrosstime. The assumptions above guarantee that subsystems canbe added/removed/replaced independently of each other atany time, that the paths of interaction may even be nonde-terministic, and that no one has to be informed of the change(loosely coupled integration). Such a decoupling is importantand interesting for several reasons. First it allows develop-ing Local Diagnosers independently of each other, making itpossible to control the design of the diagnostics in a complexsystem. This is not only a problem of managing complex-ity. In many application cases the subsystems are designedby different entities (e.g., suppliers of the company assem-bling the complex system) and thus are black boxes for thedesigner of the complex system. The problems are commonin many application domains. For example, in aerospace ap-plications, it is very common that systems are assembled withsubsystems provided by different suppliers. A simple exam-ple is a landing gear, including hydraulic, mechatronic andelectronic components that are assembled by the assembler ofthe aircraft. Each subsystem (e.g., power transmission, or anhydraulic actuator) is supplied together with its own FMECAand control/diagnostic software, without any information onits internal details.

Indeed in our approach we assume that Local Diagnosersare designed independently of each other, and that the designof a Supervisor requires only that Local Diagnosers imple-ment a common communication interface (which is a reason-able assumption in an assembler-supplier relationship). Inthis sense a decentralized approach where local diagnosers


communicate with a Supervisor is more flexible than a fullydistributed approach, as we can assume that Local Diagnosershave no information or model about the others and we makeno assumption on the diagnosers and the models they use(provided the models share a common ontology).

The paper is organized as follows: in the next section wedescribe the decentralized architecture we propose. Section3 explains the assumptions we make on the models used fordiagnosis. Then, section 4 discussed how decentralized diag-nosis is carried out in our approach. Section 5 concludes witha discussion on related work.

2 ArchitectureThe architecture we define is a supervised one where:

• A Local Diagnoser is associated with each subsystem.The model of the subsystem is private and it is not visibleoutside the diagnoser.

• A Supervisor of the diagnostic process is associated withthe system. It coordinates the work of the diagnosersoptimizing their invocations during a diagnostic session.

The structure of the system (i.e., the subsystems and theirconnections) may be known a-priori to the Supervisor or maybe reconstructed by it during the interaction (without any a-priori information). This increases the flexibility of the ap-proach as it allows us to deal with system whose structure(number of components and their connections) may vary dy-namically. The architecture defines a communication inter-face between the Local Diagnosers and the Supervisor. Localdiagnosers are awakened by the subsystem they are in chargeof whenever an anomaly (fault) is detected. They may ex-plain the fault locally or may put the blame on an input re-ceived from another subsystem. They communicate to theSupervisor their explanations (without disclosing the detailsof the local failure(s)). The Supervisor may invoke the LocalDiagnoser of the blamed subsystem asking it to generate ex-planations. The Supervisor may also ask a Local Diagnoserto check some predictions. The selection of the Local Di-agnosers to invoke is made according to rules that optimizeinvocations, as we will discuss in section 4.2.

The architecture is scalable (hierarchical) in the sense thatthe Supervisor can act as a Local Diagnoser when the systemis used as a subsystem in a more complex system.

Notice that, in order to communicate, the Supervisor andLocal Diagnosers must speak a common language; in otherwords, we must assume that they share a modeling ontologythat allows them to share information about exchanged quan-tities. This nevertheless allows each Local Diagnoser to useits own modeling language and assumptions, and even to in-ternally use a different ontology if needed, provided that it isable to map it over the common one during communications.

3 ModelsThe approach we propose focuses on Qualitative Models; inthis paper we will in particular deal with deviation models[Malik and Struss, 1996] although we will briefly describehow our proposal can be applied to other qualitative modelsas well. It is worth noting that deviation models proved to be

Engine

Cockpit

HydraulicExtractionSubsystem

LEDs

LandingGear

DC Power

PowerTransmission

∆E

∆X

∆V

∆Y

IndependentPowerSupply

∆C

∆T

Figure 1: The Landing Gear section of an Aircraft System

successful in several applications to technical domains (e.g.,[Sachenbacher et al., 2000; Picardi et al., 2002; 2004 ]).

As we discussed in the previous section, the Supervisor isnot aware of specific subsystem models; however, it is awareof their existence and, in order to coordinate Local Diagnoserand combine the information they provide, it must make someassumptions on the modeling ontology. In this section we willdiscuss these assumptions.

First of all, as it is common in a large part of model-baseddiagnosis, we consider component-oriented models, whereeach component has one correct behavior and one or morefaulty behaviors. Each behavior of a component is modeledas a relation over a set of variables. Variables that representquantities exchanged with other components are either inputor output variables. The complete system is specified as a setof assignments of an output variable of one component to aninput variables of another one.

By Qualitative Model (of a physical system) we intend amodel where all variables have a finite, discrete domain, thusrepresenting either a discretization of the corresponding phys-ical quantity, or an abstraction over some of its properties.This implies that, in a qualitative model, the relation that de-fines a component or system can be expressed extensionally(although in most cases this is not desirable). In this context,each component C can be represented as a unique relationover a set of variables by introducing a distinguished modevariable C.m, that ranges over the set of behavior modesok, ab1, . . . , abn of the component itself.

As mentioned above, in this paper we will particularly fo-cus on deviation models, which means that:

• Each physical quantity q represented in the model hasa corresponding variable ∆q representing a qualitativeabstraction of the deviation of q with respect to its ex-pected value. Common domains for deviation variablesare ok, ab, expressing whether the value of q is nor-mal or not, and −, 0, +, expressing whether q is lowerthan, equal to, or higher than its expected value. Here wewill take this last option; we will therefore be discussingsign-based deviation models.

• Each behavior model expresses relations among devia-tion variables. For example, an electrical model mightstate that if the resistance in a circuit is higher than ex-pected, than the intensity of the current flowing in thecircuit is lower than expected.

We will now introduce a running example that we will usethroughout the paper to illustrate the approach.

Figure 1 shows a simplified part of an Aircraft System,namely the part concerned with the Landing Gear. The pic-


ture represents a high-level view of the system: we do not seeindividual components, but interacting subsystems. Subsys-tem pictured with a dashed line do not directly take part in theexample, but are shown for the sake of completeness.

The Hydraulic Extraction System (HES) creates a pressurethat mechanically pushes the Landing Gear, thereby extract-ing it. The HES is also connected to some leds on the cock-pit, that show whether the subsystem is powered up or not. Inorder to create the pressure, the HES takes power from twosources. The main source is the Power Transmission fromthe aircraft engine, which actually powers up the main pumpof the HES. A secondary source (used to transmit the pilotcommand from the cockpit, and to light up leds) is the Inde-pendent Power Supply, which produces a low-amperage DCcurrent.

We will detail parts of this example in the following, asneeded to explain the different aspects of the decentralizeddiagnostic process.

4 Decentralized Diagnostic ProcessIn this section we will describe how the diagnostic processis carried out. As we have already mentioned, a Super-visor coordinates the activities of several Local DiagnosersLD1, . . . , LDn. In order to obtain a loosely coupled in-tegration, the Supervisor assumes that each LD i is stateless,that is every invocation of a Local Diagnoser is independentof the others. Moreover, the Supervisor does not make anyassumption on the implementation of each Local Diagnoser.The interaction is thus defined by:

• An interface that each Local Diagnoser must implement,discussed in section 4.1. The role of Local Diagnosersconsists in generating htpotheses consistent with theirmodel and observations.

• An algorithm for the Supervisor that computes diag-noses by coordinating the Local Diagnoser through theabove interface, explained in section 4.2. The role of theSupervisor is to propagate hypotheses to the neighborsof a Local Diagnoser.

Figure 2 shows an example of this architecture for the sys-tem in Figure 1.


LEDs

LandingGear

DC Power

PowerTransmission


LEDs

LandingGear

DC Power

PowerTransmission

LD2

LD1

LD3 LD4

LD5

Supervisorextend ex

tend

Figure 2: An example of decentralized architecture

Hydraulic Extraction Subsystem

PumpPipe Valve

PressureChamber ∆E

extraction

∆CDC power

(command)

∆Vmain power

∆YDC power(output)

∆XDC power

(input)

Wire

Figure 3: A closer view to the Hydraulic Extraction System

4.1 The Extend interface

We assume that each Local Diagnoser LDi reasons on amodel Mi of its subsystem Si, according to what we dis-cussed in section 3. From the point of view of the Supervisorand its interactions with Local Diagnosers, each model M i isa relation over a set of variables where:

• mode variables, denoted by Si.m, express the behaviormode of components in Si; each of them has thus a dif-ferent domain;

• input/output variables express deviations of quantitiesthat Si exchanges with other subsystems.

Of course a model Mi may include additional (internal, andprivate) variables, and may be not expressed at all as an exten-sional relation. This is however hidden in the implementationof each Local Diagnoser.

The interface between the Supervisor and the Local Diag-nosers is made of a single operation called EXTEND that theSupervisor invokes on the Local Diagnosers. Moreover, uponan abnormal observation, Local Diagnosers autonomouslyexecute EXTEND and send the results to the Supervisor.

The goal of EXTEND is to explain and/or verify the consis-tency of observations and/or hypotheses made by other Lo-cal Diagnosers. Thus, the input to EXTEND when invokedon LDi is a set of hypotheses on the values of input/outputvariables of Mi. Such hypotheses are represented as a set ofpartial1 assignments over the variables of interest.

In the particular case where EXTEND is autonomously ex-ecuted by a Local Diagnoser upon receiving an abnormal ob-servation from its subsystem, the set of hypotheses containsonly the empty assignment (i.e. an assignment with an emptydomain).

For each hypothesis α received in input, LDi must first ofall check whether α is consistent with the local diagnosticmodel Mi and the local observations ωi. Then LDi must ex-tend α in two directions: backwards to verify whether thathypothesis needs further assumptions to be supported withinMi and ωi, and forward to see whether the hypothesis hasconsequences on other subsystems, that may be verified inorder to discard or confirm it.

Let us consider as an example the Hydraulic ExtensionSystem, a (simplified) close-up view of which is depicted inFigure 3.

Suppose the Local Diagnoser LD3, associated to the HES,receives in input an assignment that assigns the value ”−” to

1A partial assignment over a set of variables V is any assignmentα whose domain Dom(α) is a subset of V .


its output data variable HES.∆E, representing the mechani-cal extraction of the Landing Gear. This value means that theLanding Gear did not extract when expected.

LD3 may see that, with respect to the local model, for thatpiece of data to be less than expected, one of the followingmust have happened:

• An internal component has failed, for example the pumpis stuck or the pipe is leaking.

• One of the two inputs of the pump is wrong, for exam-ple the pump is not powered or it has not received thecommand.

This means that the partial assignment HES.∆E = − canbe extended in four ways: by assigning HES.m = stuck, orHES.m = leaking, or HES.∆V = −, or HES.∆C = −.

What is important regarding extensions is that they shouldinclude everything that can be said under a given hypothesis,but nothing that could be said also without it. In other words,we are interested in knowing whether a given assignment con-strains other variables more than the model alone does. Thefollowing definition captures and formalizes this notion.

Def. 1 2 Let Mi be a local model with local observations ωi,and let γ be an assignment. We say that γ is admissible withrespect to Mi and ωi if:

i. γ is an extension of ωi;

ii. γ is consistent with Mi;

iii. if Mi,γ is the model obtained by constraining M i with γ,its restriction to variables not assigned in γ is equivalentto the restriction of Mi itself to the same variables:

Mi,γ |VAR(Mi)\Dom(γ) ≡Mi|VAR(Mi)\Dom(γ).

Thus, for each assignment α received in input, the EX-TEND operation of a Local Diagnoser LDi returns a set ofassignments Eα that contains all minimal admissible exten-sions of α with respect to Mi and ωi, restricted to input, out-put and mode variables of Mi. Notice that, whenever α isinconsistent with the observations and/or the model, Eα isempty.

Minimal admissibility avoids unnecessary committmentson values of variables that are not concretely constrained bythe current hypothesis.

In order to illustrate these definitions, let us consider a(again, simplified) model MP of the pump P alone. Themodel includes four variables: P.m represents the behaviourmode, P.∆V the power supply to the pump, P.∆C the com-mand that turns on the pump, and P.∆F the flow coming outfrom the pump. The extensional representation of MP is:

P.m P.∆C P.∆V P.∆F P.m P.∆C P.∆V P.∆F

ok 0 0 0 stuck 0 0 −ok 0 + + stuck 0 + 0, +,−ok 0 − − stuck 0 − −ok + 0 + stuck + 0 0, +,−ok + + + stuck + + 0, +,−ok + − 0, +,− stuck + − 0, +,−ok − 0 − stuck − 0 −ok − + 0, +,− stuck − + 0, +,−ok − − − stuck − − −

2The definitions in this section are rephrased from [Ardissono etal., 2005].

Now suppose we execute EXTEND on MP alone, havingin input an assignment α such that Dom(α) = P.∆Fand α(P.∆F ) = −. Let us first of all verify that α itselfis not admissible, by computing MP |P.m,P.∆C,P.∆V andMP,α|P.m,P.∆C,P.∆V .

MP |P.m,P.∆C,P.∆V

P.m P.∆C P.∆V

ok 0 0ok 0 +ok 0 −ok + 0ok + +ok + −ok − 0ok − +ok − −

stuck 0 0stuck 0 +stuck 0 −stuck + 0stuck + +stuck + −stuck − 0stuck − +stuck − −

≡

MP,α|P.m,P.∆C,P.∆V P.m P.∆C P.∆V

ok 0 −ok + −ok − 0ok − +ok − −

stuck 0 0stuck 0 +stuck 0 −stuck + 0stuck + +stuck + −stuck − 0stuck − +stuck − −

It is easy to see that the minimal admissible extensions of αwrt MP are the following:

P.m P.∆C P.∆V P.∆F

γ1 stuck −γ2 − −γ3 − −

γ1, γ2 and γ3 express the three possible explanations for∆F = −: either the pump is stuck, or it is not powered, orit did not receive the command. For each of the three assign-ments, unassigned variables are those whose value is irrele-vant to the explanation.

4.2 The Supervisor algorithmThe Supervisor is started by a Local Diagnoser LDfirst thatsends to it the results of an EXTEND operation executed as aconsequence of an abnormal observation.

During its computation, the Supervisor records the follow-ing information:

• a setH of assignments, representing current hypotheses;

• for each assignment α and for each variable x ∈Dom(α), a modified bit. By mdf(α(x)) we will denotethe value of this bit for variable x in assignment α.

Moreover, we assume that given an input or output variablex that has been mentioned by one of the Local Diagnosers,the Supervisor is able to determine the variable conn(x) con-nected to it. We do not make any assumption on whether thisinformation is known a priori by the Supervisor, or it is re-trieved dynamically, or it is provided by the Local Diagnoseritself.

The Supervisor initializes its data structures with the re-sults of the initial EXTEND operation that has awakened it:

• H contains all the assignments that were sent by LDfirstas the result of EXTEND;

• for each α ∈ H, and for each x ∈ Dom(α), the Supervi-sor extends α to the variable conn(x) connected with xby assigning α(conn(x)) = α(x). Then the Supervisorsets mdf(α(conn(x))) = 1.


Modified bits are used by the Supervisor to understandwhether it should invoke a Local Diagnoser or not. The basicrule it uses is the following:Basic Rule. If a subsystem Si has a variable x withmdf(α(x)) = 1 for some α, then LDi is a candidate for invo-cation and α should be passed on to EXTEND.

There are however two exceptions to this rule, that allow toreduce the number of invocations:

Vertical Exception. Given a variable x belonging to a sub-system Si, if all assignments give the same value to x,and either the value is 0 or x is an input variable, then xand its modified bits should not count towards decidingwhether LDi should be invoked or not.This exception avoids that a Local Diagnoser is invokedto verify/discard something that has already been as-sessed to be true. If however the assesses truth concernsan abnormal value for one of S i’s outputs, then LDi isstill asked to give an explanation.This exception is referred to as vertical because it con-siders the value of a variable throughout different assign-ments, that is a column in the assignments table repre-sentingH.

Horizontal Exception. Given an assignment α and a sub-system Si, if mdf(α(x)) = 1 implies α(x) = 0, thenα and its modified bits should not count towards decid-ing whether LDi should be invoked or not. If LD i isnevertheless invoked, α should be passed on to extendonly if LDi has never been invoked before.This exception avoids that a Local Diagnoser is invokedonly to verify that ”everything ok” is consistent with itsmodel (which is trivially true).It is called horizontal because it considers values of dif-ferent variables in the same assignment, that is a row inthe assignments table representingH.

Notice that when deciding whether the Horizontal Excep-tion can be applied, only variables with modified bits set to1 are considered. In order to apply it, we do not need thewhole assignment to assign value 0, but only the part of itthat has never been examined by the proper Local Diagnoser.Thus , if a Local Diagnoser LDi has already been invoked, itis never invoked again until one of the variables in S i is as-signed a value different than 0. Thanks to the definition ofEXTEND, we do not have to worry that a newly assigned 0 isinconsistent with a previously extended assignment.

After initializing data structures, the Supervisor loops overthe following three steps:

Step 1: select the next LDi to invoke. The Supervisor se-lects one of the Local Diagnosers LDi for which there isat least one assignment α meeting the Basic Rule withno Exception. If there is none, the loop terminates.

Step 2: invoke EXTEND. If LDi has never been invoked be-fore in this diagnostic process, then the input to EXTENDis the set of all assignments in H , restricted to variablesof Mi. Otherwise the input consists only of those assign-ments α that meet the Basic Rule but not the HorizontalException.

Step 3: updateH. The Supervisor receives the output ofEXTEND from LDi. For each α in input, EXTEND hasreturned a (possibly empty) set of extensions Eα =γ1, . . . , γk. Then α is replaced in H by the set ofassignments β1, . . . , βk where βj is obtained by:

• combining α with each γj ∈ Eα;• extending the result of this combination to con-

nected variables, so that for each x ∈ Dom(γ) rep-resenting an input/output variable, βj(conn(x)) =βj(x).

This implies that rejected assignments, having no exten-sions, are removed fromH.

Step 4: update the mdf bits. For each assignment βj addedin Step 3 mdf bits are set as follows:

(i) For each variable x not belonging to M i such thatx ∈ Dom(βj) and x ∈ Dom(α), mdf(βj(x)) is setto 1.

(ii) For each variable x belonging to M i such that x ∈Dom(βj), mdf(βj(x)) is set to 0.

(iii) For any other variable x ∈ Dom(βj),mdf(βj(x)) = mdf(α(x)).3

Notice that the diagnostic process terminates: new requestsfor EXTEND are generated only if assignments are properlyextended, but assignments cannot be extended indefinitely.

At the end of the loop, assignments in H provideconsistency-based diagnoses as follows:

Def. 2 Let α be an assignment inH at the end of the diagnos-tic process. The diagnosis associated with α is the assignment∆(α) such that:

• Dom(∆(α)) is the set of all mode variables of all in-volved models;

• for each x ∈ Dom(α)∩Dom(∆(α)), ∆(α)(x) = α(x);

• for each x ∈ Dom(∆(α)) \ Dom(α), ∆(α(x)) = ok.

Of these diagnoses, non-minimal ones are discarded.It can be proved that this algorithm computes all minimal

consistency based diagnoses with respect to the model andobservations of the part of the system involved in the diag-nostic process.

Notice that in the Supervisor algorithm the role of modevariables is rather peculiar. Since these variables are localto a given system, the Supervisor never communicates theirvalue to a Local Diagnoser different than the one originatingthem. Thus, they are not needed for cross-consistency checks.There are two main reasons why we need the Supervisor tobe aware of them:

• They provide the connection between two different invo-cations on the same Local Diagnoser; by having the Su-pervisor record them and send them back on subsequentcalls, we allow the Local Diagnosers to be stateless.

• The Supervisor needs them to compute globally minimaldiagnoses.

3Variables in Dom(βj) but not in Dom(α) are all covered in thefirst two cases.


Notice however that, for both of these goals, the value ofmode variables needs not be explicit: in other words the Lo-cal Diagnosers may associate to each fault mode a coded Idand use it as a value for mode variables. In this way, infor-mation on what has happened inside a subsystem can remainprivate.

Let us conclude this section with an example of executionof the Supervisor algorithm on the system in figure 1. Thenames of the subsystems will be shortened as follows: Engwill denote the Engine; DCP the DC Power; PT the PowerTransmission; HES the Hydraulic Extraction System; LG theLanding Gear; LED the LEDs. Moreover, we will write S.xto denote variable x belonging to subsystem S, and S.m willdenote a variable summarizing the behaviour mode of com-ponents in S.

Let us assume that LG observes a − value for LG.∆E. Itautonomously invokes EXTEND, which as an output has onlythe following assignment:

LG.m LG.∆E

α0 −

The Supervisor receives this assignment, and accordinglyinitializes its data structures. Figure 4 shows howH changesduring the execution of the Supervisor algorithm; in particu-lar,H0 corresponds toH immediately after initialization.4

At this stage the only candidate for invocation is LD3, re-sponsible for the HES. Its input consists of assignment α0

restricted to HES variables, which (as already discussed insection 4.1) is extended by LD3 as follows:

HES.∆E

α0 − =⇒HES.m HES.∆E HES.∆V HES.∆C

α01 leak −α02 stuck −α03 − −α04 − −

The Supervisor receives these result and updates H, ob-tainingH1 depicted in Figure 4.

Now there are two candidates for invocation, LD2 and LD1,respectively responsible for the DCP and for the PT. Theycan be invoked in any order; let us assume that LD2 is in-voked first. Since it is the first invocation, its input are all theassignments in H restricted to DCP variables. For the sakeof the example, let us assume that the DCP can produce a −on ∆C only by being failed itself, and that as consequence ofthis failure also ∆X should be− as well. Then, LD2 extendsthe input assignments in this way:

DCP.∆C

α0[1,2,3]

α04 −=⇒

DCP.m DCP.∆C DCP.∆X

α0[1,2,3]

α04 fail − −

The Supervisor therefore updates H; the result is the setH2.

Now it’s LD1’s turn. The input and output of EXTEND are:PT.∆V

α0[1,2,4]

α03 −=⇒

PT.m PT.∆V PT.∆T

α0[1,2,4] 0

α03 fail − 0

In this case, we assume that the PT has observed that itsinput ∆T from the engine is normal. As a consequence, itcan explain the − on ∆V only with an internal failure. Thus,the output sent to the Supervisor is:

4In Figure 4 the symbol indicates that for the correspondingassignment and variable the modified bit is set to 1.

The set of hypotheses is updated accordingly and the Su-pervisor obtainsH3.

The only candidate for invocation is now LD3, because al-though Eng has some modified bits set to 1, it falls in both theExceptions. Thus LD3 is invoked again, only on the modifiedassignment, that is α04, restricted to HES variables. Now, letus assume that from the point of view of the HES having∆X = − implies necessarily that also ∆Y = −. The resultof EXTEND is then:

HES.m HES.∆E HES.∆V HES.∆C HES.∆X HES.∆Y

α04 − − − −

H now becomes as H4 in Figure 4. Now LD4 is invoked;let us assume that observations on LED report that everythingis ok with the leds. What happens is that one of the two inputassignment is extended, while the other is rejected as incon-sistent. Here are the input and output of EXTEND:

LED.∆Y

α0[1,2,3]

α04 −=⇒ LED.M LED.∆Y

α0[1,2,3] 0

UpdatingH the Supervisor obtainsH5.Notice that due to the Exception there is no Local Diag-

noser left to invoke, thus the diagnostic process ends here.Minimal diagnoses thus are:

HES.m = leak: the pipe of the Hydraulic Extraction Systemis leaking (from α01).

HES.m = stuck: the pump of the Hydraulic Extraction Sys-tem is stuck (from α02).

PT.m = fail: the Power Transmission is broken (from α03).

4.3 Other types of models

We briefly discuss here how the approach can be applied alsoto models different than the sign-based deviation models usedin our explanation.

In general, the algorithm can be applied as is to any type ofdeviation model where there is only one value associated tothe nominal behavior. Models with variable domains such asok, ab or −−,−, 0, +, ++ belong to this category.

In order to apply the approach to a generic qualitativemodel, we need to modify the Supervisor algorithm by:

• discarding the Horizontal Exception;

• modifying the Vertical Exception, so that it applies onlyto input variables.

It is clear that giving up the Horizontal Exception meansincreasing the number of invocations to Local Diagnosers.What however matters is whether these invocations are usefulfor the Diagnostic Process; in other words, whether the levelof abstraction in the model is suitable for the diagnostic task.

For mixed models, that is models that combine deviationvariables with other types of variables, it may neverthelsessbe possible to define other exceptions to the Basic Rule, morerestrictive than the one described here, that allow to reducethe number of invocations. However each type of modelshould be analyzed separately to discover whether this op-timization is possible.


H0

LG.m LG.∆E HES.∆E

α0 − −

H1

LG.m LG.∆E HES.m HES.∆E HES.∆V HES.∆C DCP.∆C PT.∆V

α01 − leak −α02 − stuck −α03 − − − −α04 − − − −

H2

LG.m LG.∆E HES.m HES.∆E HES.∆V HES.∆C HES.∆X DCP.m DCP.∆C DCP.∆X PT.∆V

α01 − leak −α02 − stuck −α03 − − − −α04 − − − − fail − −

H3

LG.m LG.∆E HES.m HES.∆E HES.∆V HES.∆C HES.∆X DCP.m DCP.∆C DCP.∆X PT.m PT.∆V PT.∆T Eng.∆T

α01 − leak − 0 0α02 − stuck − 0 0α03 − − − fail − 0 0α04 − − − − fail 0 0

H4

LG.m LG.∆E HES.m HES.∆E HES.∆V HES.∆C HES.∆X HES.∆Y LED.∆Y DCP.m DCP.∆C DCP.∆X PT.m PT.∆V PT.∆T Eng.∆T

α01 − leak − 0 0α02 − stuck − 0 0α03 − − − fail − 0 0α04 − − − − − − fail 0 0

H5

LG.m LG.∆E HES.m HES.∆E HES.∆V HES.∆C HES.∆X HES.∆Y LED.∆Y DCP.m DCP.∆C DCP.∆X PT.m PT.∆V PT.∆T Eng.∆T

α01 − leak − 0 0 0 0α02 − stuck − 0 0 0 0α03 − − − 0 0 fail − 0 0

Figure 4: The setH in different stages of the Supervisor algorithm, during a diagnosis for the system in Figure 1.

5 ConclusionsIn this paper we discuss a supervised decentralized approachto diagnosis that allows to loosely couple multiple diagnosers,in order to deal with architecturally complex systems.

The goal of the paper is to show that we can effectivelyperform the diagnostic task and define intelligent strategiesfor the Supervisor, even if it is not aware of the internal as-pects of Local Diagnosers, the Local Diagnosers do not knoweach other, and their paths of interaction dynamically change.

In this paper we do not face the problem of the implemen-tation of Local Diagnosers. The discussion of efficient algo-rithms that execute the EXTEND operation is out of the scopeof this paper, especially since such algorithms would stronglydepend on the modeling language and specific assumptions ofeach individual Local Diagnoser.

It is worth noting, however, that the decentralized approachcan be hierarchically applied to subsystems, possibly havingthe same program playing both the role of the Supervisor andof the Local Diagnoser. In this case the advantage of the de-centralized approach would be, rather than information hid-ing, to exploit the Supervisor to choose which parts of themodel should be analyzed. At the lowest level (i.e., the levelof basic components) this would still require to directly ex-ecute EXTEND; however for small models the results of thisoperation could easily be pre-compiled.

Hierarchical scalability is indeed an important feature ofour approach which allows to integrate hierarchical decom-position as a further means to deal with the complexity of the

systems to be diagnosed.As we already mentioned, the approach proposed in this

paper is a generalization of the one proposed in [Ardissono etal., 2005] to deal with diagnosis of Web Services. In particu-lar, the definition of the EXTEND interface is borrowed from[Ardissono et al., 2005], while the Supervisor algorithm isgeneralized in order to loosen the assumptions and still pro-duce all minimal consistency-based diagnosis.

If we consider other approaches in the literature, we seethat [Pencole and Cordier, 2005] has some relevant similarityfrom the point of view of diagnostic architecture: a supervi-sor is in charge of computing global diagnoses exploiting thework of several local diagnosers. However, this work differsin some significant respects:

• the system to diagnose is modeled as a discrete-eventsystem;

• the main problem is to avoid composing the wholemodel, because this would produce a state-space explo-sion;

• as a consequence, the supervisor is aware of the sub-system models, but cannot compose them: it can onlycompose local diagnoses to obtain a global one;

• due to the nature of the considered systems, reconstruct-ing global diagnoses is a difficult task, and as such it isone of the main focuses of the paper.

Thus [Pencole and Cordier, 2005] actually focuses on quite


different theoretical and practical problems than those ad-dressed in this paper.

Other papers in the literature deal with completely dis-tributed approaches to diagnosis, where diagnosers commu-nicate with each other and try to reach an agreement, withouta supervisor coordinating them. Our choice of a supervisedapproach was motivated by the need of having loosely cou-pled diagnosers, while a purely distributed approach requiresdiagnosers to establish diagnostic sessions and exchange agreater amount of information in order to compute diagnosesthat are consistent with each other.

Nevertheless, the distributed approach proposed in [Rooset al., 2003] has some similarity with ours because it is basedon the idea of diagnosers explaining blames received fromothers, or verifying hypotheses made by others. Needless tosay, the proposed algorithms are completely different, havingto deal with a distributed approach. Moreover, in order to re-duce computational complexity, the authors introduce somefairly restrictive assumptions on the models (e.g. requiringthat explanations imply normal observations, or that outputvalues are either completely undetermined or completely de-termined), which are not acceptable in our case.

Another distributed approach on a similar basis is the onein [Kalech and Kaminka, 2004]; however in this case the fo-cus is on diagnosing failures in the communication amongagents in a team, a problem the authors refer to as social di-agnosis. Thus the tackled problem is different from the onewe cope with, since it deals with failures arising in the com-munication between subsystems with different sets of beliefs,rather than with failures happening inside a subsystem andpropagating to others.

Finally, an interesting distributed approach is tackled in[Provan, 2002]. Here the scenario is rather similar to ours:each diagnoser has a local model and observations, which arenot shared with the others. Each local diagnoser computeslocal minimal diagnoses, and then engages in a conversationto reach a globally sound diagnosis. However, being the ap-proach purely distributed, solutions are different. In particu-lar, in order to propose a solution with reduced complexity,the author focuses on systems whose connection graph has atree-like structure. On the contrary, our approach (thanks tothe presence of the supervisor) does not need to make any as-sumption on system structure; in fact, system structure mayeven dynamically change.

References[Ardissono et al., 2005] L. Ardissono, L. Console, A. Goy,

G. Petrone, C. Picardi, M. Segnan, and D. TheseiderDupre. Cooperative model-based diagnosis of web ser-vices. In Proceedings of the Sixteenth International Work-shop on the Principles of Diagnosis (DX 05), PacificGrove, California, 2005.

[Genesereth, 1984] M.R. Genesereth. The use of design de-scriptions in automated diagnosis. Artificial Intelligence,24(1-3):411–436, 1984.

[Kalech and Kaminka, 2004] M. Kalech and G.A. Kaminka.Diagnosing a team of agents: scaling-up. In Proc. 15th Int.

Work. on Principles of Diagnosis (DX-04), pages 129–134,2004.

[Malik and Struss, 1996] A. Malik and P. Struss. Diagnosisof dynamic systems does not necessarily require simula-tion. In Proc. 7th Int. Work. on Principles of Diagnosis,pages 147–156, 1996.

[Mozetic, 1991] I. Mozetic. Hierarchical model-based di-agnosis. Int. J. of Man-Machine Studies, 35(3):329–362,1991.

[Pencole and Cordier, 2005] Y. Pencole and M.-O. Cordier.A formal framework for the decentralised diagnosis oflarge scale discrete event systems and its applicationto telecommunication networks. Artificial Intelligence,164(1-2), 2005.

[Picardi et al., 2002] C. Picardi, R. Bray, F. Cascio, L. Con-sole, P. Dague, O. Dressler, D. Millet, B. Rehfus, P. Struss,and C. Vallee. IDD: Integrating Diagnosis in the Design ofautomotive systems. In Proceedings of the 15th EuropeanConference on Artificial Intelligence (ECAI2002), pages628–632, 2002.

[Picardi et al., 2004] C. Picardi, L. Console, F. Berger,J. Breeman, T. Kanakis, J. Moelands, E. Arbaretier, S. Col-las, N. De Domenico, E. Girardelli, O. Dressler, P. Struss,and B. Zilbermen. AUTAS: a tool for supporting FMECAgeneration in aeronautic systems. In Proceedings ofthe 16th European Conference on Artificial Intelligence(ECAI2004), 2004.

[Provan, 2002] G. Provan. A model-based diagnostic frame-work for distributed systems. In Proc. 13th Int. Work. onPrinciples of Diagnosis (DX-02), pages 16–22, 2002.

[Roos et al., 2003] N. Roos, A. ten Teije, and C. Witteveen.A protocol for multi-agent diagnosis with spatially dis-tributed knowledge. In 2nd Int. Joint Conference onAutonomous Agents and Multi-Agent Systems (AAMAS-2003), Melbourne, Australia, July 2003.

[Sachenbacher et al., 2000] M. Sachenbacher, P. Struss, andR. Weber. Advances in design and implementation of obdfunctions for diesel injection systems based on a qualita-tive approach to diagnosis. In SAE 2000 World Congress,2000.


Comparing diagnosability in Continuous and Discrete-Event Systems

Marie-Odile CordierINRIA

Rennes, France

Louise Trave-Massuyes and Xavier PucelLAAS-CNRS

Toulouse, France

Abstract

This paper is concerned with diagnosability analy-sis, which proves a requisite for several tasks duringthe system’s life cycle. The Model-Based Diagno-sis (MBD) community has developed specific ap-proaches for Continuous Systems (CS) and for Dis-crete Event Systems (DES) in two distinct and par-allel tracks. In this paper, the correspondences be-tween the concepts used in CS and DES approachesare clarified and it is shown that the diagnosabilityproblem can be brought back to the same formula-tion using the concept of signatures. These resultsbridges CS and DES diagnosability and open per-spectives for hybrid model based diagnosis.

1 IntroductionDiagnosis is an increasingly active research domain, whichcan be approached from different perspectives according tothe type of system at hand and the required abstractionlevel. Although some recent works have considered diag-nosis based on hybrid models [Williams and Nayak, 1996;Benazera et al., 2002; Benazera and Trave-Massuyes, 2003;Gupta et al., 2004], the Model-Based Diagnosis (MBD) com-munity has developed specific approaches for ContinuousSystems (CS) and for Discrete Event Systems (DES) in twodistinct and parallel tracks. Algorithms for monitoring, diag-nosis and diagnosability analysis have been proposed [Sam-path et al., 1995; Jiang et al., 2001; Yoo and Lafortune, 2002;Cimatti et al., 2003; Roze and Cordier, 2002; Jeron et al.,2006; Patton et al., 1989; Staroswiecki and Comtet-Varga,1999; Frisk et al., 2003; Struss and Dressler, 2003]. The for-malisms and tools are quite different : the CS communitymakes use of algebro-differential equation models or quali-tative abstractions whereas the DES community uses finite-state formalisms. For diagnosability analysis, the CS ap-proaches generally adopt a state-based diagnosis point ofview in the sense that diagnosis is performed on a snapshot ofobservables, i.e. one observation at a given time point. TheDES approaches perform event-based diagnosis and achievesstate tracking, which means dynamic diagnosis reasoningachieved across time.

This paper is concerned with diagnosability analysis,which proves a requisite for several tasks during the system’s

life cycle, in particular instrumentation design, end-of-linetesting, testing for diagnosis, etc. In spite of quite differentframeworks, it is shown that the diagnosability assessmentproblem stated on both sides can be brought back to the sameformulation and that common concepts can be proposed forproving diagnosability definitions equivalent. This result pro-vides solid ground for considering the analysis of hybrid sys-tems diagnosability.

2 DES and CS modelling approachesThis section presents the different theories used to modelDESs and CSs. The principles underlying DES and CS modelbased diagnosis are given and diagnosability is introduced onboth sides. Both approaches rely on the analysis of the ob-servable consequences of faults, i.e. symptoms.

The main difference between DES and CS diagnosabil-ity analysis processes is that the order of appearance of thesymptoms is only taken into account in the DES approach.In the CS approach, fault occurrence assumes immediate andsimultaneous observation of the symptoms, while the DESapproach diagnosis relies on the observation of a sequence ofsymptoms after fault occurrence. Proof is given that, assum-ing the system observed a sufficiently long time, diagnosabil-ity conditions for DES and CS are conceptually equivalent.

2.1 The models

DES modelA DES is modelled by a language Lsys ⊆ E∗ where E is theset of system events. Lsys is prefix-closed, and can be de-scribed by a regular expression, or generated by a finite stateautomaton G = (Q, E, T, q0) where Q is the set of states,E the set of events, T ⊆ (Q × E × Q) the transition rela-tion and q0 the initial state. Each trajectory in the automatoncorresponds to one word of the language, and represents asequence of events that may occur in the system. The setof events E is partitioned into observable and unobservableevents : E = Eo ∪ Euo, and a set of faults Ef ⊆ Euo

is given. The diagnosis process aims at detecting and as-sessing the occurrence of unobservable fault events from asequence of observed events. The set OBS is defined asthe set of all the possible observable events sequences, i.e,OBS = (e1e2 . . . en) where n is any positive integer.


In this article, it is assumed that the automaton is determin-istic (T : Q → E×Q is a function), generates a live language(every state has at least one outgoing transition), and containsno cycle of unobservable events.

The diagnosis process makes use of a projection operationthat removes all unobservable events from a trajectory. Theinverse operation is applied to a set of observable events se-quences and leads to the diagnoses. A fault is diagnosablewhen its occurrence is always followed by a bounded observ-able event sequence that cannot be generated in its absence(see definition 1).

CS modelThe behavior model of a CS Σ = (R, V ) is generally de-scribed by a set of n relations R, which relate a set of mvariables V . In a component-oriented model, these relationsare associated to the system physical components, includingthe sensors. The set R is partitioned into behavioral relationswhich correspond to the internal components and observationrelations which correspond to the sensors. The set of vari-ables V is also partitioned into the set of observed variablesO, whose corresponding value tuples are called observations,and the set of unobserved variables noted X .

Observation values, possibly processed into fault indica-tors, provide a means to characterize the system at a giventime. In a pure consistency-based approach, in which onlythe normal behavior of the system is modelled, the designermay use the model to establish a set of Analytical RedundantRelations, which can be expressed as a set of residuals. In thatcase, the observations result in a boolean fault indicator tuple.In the following, we will refer without loss of generality to theobservation tuples and define the set OBS as the set of all thepossible observation tuples, i.e., OBS = (o1, o2, ..., ok)where k is the number of sensors. The observation valuepattern is referred to as the observed signature whereas theexpected value patterns for a given fault, obtained from thebehavioral model, provide the fault signature. Note that sev-eral value patterns may correspond to the same fault, for ex-ample when the system undergoes several operating modes.The fault signature is hence defined as the set of all possibleobservable variable value tuples under the fault. The diagno-sis process relies on comparing the observed signature withfault signatures. Fault signatures also allow one to test faultdetectability.

2.2 The set of observablesIn the case of DES, observations consist in a sequence of ob-servable events, while in the case of CS, observations consistin a set of values for observable variables, with no ordering.

This paper focuses on comparing the notions based on ob-servations that lead to diagnosability, making abstraction ofthe nature of the observations. It is shown that the conceptof signatures can be defined in a way allowing to prove theequivalence of definitions. However, it does not imply thatany system being diagnosable when modelled as a DES is di-agnosable as a CS, due to the difference in the observationsnature. The set of observables OBS is defined as the set con-taining all the observations that are possible for the system. Itmay represent the observations obtained from a DES (a set of

ordered observable events) as well as those from a CS (a setof observable values).

3 Faults, diagnoses and fault signaturesThis section contains formal definitions of faults, diagnoses,and fault signatures. The definitions of diagnosability rely onthese (see next section).

3.1 Faults and diagnosesThe set of faults Fsys associated to a system is partitionedinto n types of faults, the partition is noted F . The followingproperties hold :

- ∀Fi, Fj ∈ F, Fi ∩ Fj 6= ∅ ⇒ i = j-

⋃n

i=0Fi = Fsys

The occurrence of one or several faults of one type is called asingle fault. When faults of several types have occurred, thesystem is said to be under a multiple fault. The set of possiblefaults that may occur in a system is the power set of F , notedP(F ). For example, ∅ describes the absence of faults, Fi asingle fault, and Fi, Fj a multiple fault. All three examplesare elements of P(F ). Faults are assumed to be permanent.

A diagnosis consists in a set of fault candidates. When adiagnosis contains only one fault, it is said to be determinate,while if it contains several faults it is indeterminate. The setof all possible diagnoses is the power set of the set of faults,noted P(P(F )). For example, ∅,

Fi

and

Fi, Fj

are determinate diagnoses, while

∅, Fi, Fj , Fk

is anindeterminate diagnosis indicating that one of the three diag-nosis candidates ∅, Fi and Fj , Fk have occurred.

3.2 Fault signaturesEstablishing fault signatures is the main part of our diagnos-ability analysis process. This concept is commonly used inthe CS approach, but less in DES. The CSs’ notion of faultsignature is generalized and extended to DESs, allowing oneto write diagnosability criterions in a unified way.

In a general way, one can consider a fault signature as afunction Sig associating a set of observables to each fault.Sig : P(F ) → P(OBS)

Continuous systemsThe fault signature is a classical concept in the CS approachusually defined as follows. For a fault f of P(F ), let OBSf

be the set of all possible tuples consisting of observed variablevalues under the fault f , regardless of time1. Then :

Sig(f) = OBSf ∈ P(OBS)

Discrete event systemsFault signatures are based upon the projection over observ-able events, which are defined in a first step. They correspondto what is usually known as observable trajectories in the DEScommunity.

Language projection The language projection over the setof observable events Eo, noted Pobs, to a language L, as-sociates the language formed by the words of L restricted

1Note that “under the fault f” means that exactly all the faults inf occured, and no faults out of f occured.


to the letters that are elements of Eo. For example ifL = e1, e1e3, e1e2, e2e3, e1e2e3 and Eo = e1, e2, thenPobs(L) = e1, e1e2, e2. The inverse projection P−1

obs , de-fined on P(OBS), to a set of observable events sequences,associates the set of trajectories (which is a language) whoseprojections belong to the antecedent set :

∀O ∈ P(OBS),P−1

obs(O) =

s ∈ Lsys, Pobs(s) ∩ O 6= ∅

Fault language For each fault f ∈ P(F ), the f -language,or Lf , describes all possible trajectories in which f occurs.Lf is defined as the subset of the system’s automaton’s lan-guage Lsys , restricted to the words containing at least oneoccurrence of every single fault event composing f , and nooccurrence of any other fault event. Lf describes all possiblescenarios in which f occurs. The words of the f -languageare called f -trajectories.

Fault signature Because of our particular interest for di-agnosability, among the set of f -trajectories, we pay specialattention to those that can be obtained when the observationtemporal window can be arbitrarily extended. This is done byconsidering, in Lf , only words that end in an infinite cycle.They are defined as the maximal words, and form the maxi-mal f -language Lmax

f of the fault. Formally, a trajectory s ofLf belongs to Lmax

f if and only if ∃t, u ∈ E∗, s = tu∞. No-tation u∞ refers to the word built as an infinite concatenationof word u, i.e., every un ∈ u∗ is a prefix of u∞.

For each fault f ∈ P(F ), the projection of the maximalf -language Lmax

f over the set of observable events is calledthe f -signature. Any f -signature is a subset of OBS as it issolely composed of observable events. With the above defini-tions, it is possible to define the signature function Sig as thefunction associating its f -signature to any fault f ∈ P(F ) :

∀f ∈ P(F ) , Sig(f) = f -signature ∈ P(OBS)

4 DiagnosabilityFormal definitions of diagnosability according to the DESand CS approaches are now given.

4.1 Discrete Event SystemsWe rely here on the (strong)2diagnosability definition as de-fined by [Sampath et al., 1995].

DES (strong) Diagnosability : a DES is (strongly) diagnos-able if and only if3 :

∀Fi ∈ F, ∃ni ∈ N, ∀s ∈ Lsys/(Fi ∈ s),∀t ∈ E∗/(st ∈ Lsys),

‖t‖ ≥ ni ⇒ ∀u ∈ P−1

obs

(

Pobs(st))

, Fi ∈ u(1)

One can notice that the definitions are stated with respect toelements of F . The system is required to be diagnosable foreach fault type, independently of the fact that they are singleor multiple faults.

2A definition for weak diagnosability is given in [Roze andCordier, 2002] for DES and in [Trave-Massuyes et al., 2004] forCS

3The notation Fi ∈ s means that s contains at least one faultevent of Fi.

4.2 Continuous systemsIn the CS approach, the classical definition of diagnosabil-ity is already given in terms of the fault signature concept asfollows [Trave-Massuyes et al., 2004].

CS (Strong) Diagnosability : a CS is (strongly) diagnosableif and only if :

∀f1, f2 ∈ P(F ), f1 6= f2, Sig(f1) ∩ Sig(f2) = ∅ (2)

This definition applies to single or multiple faults and dif-fers from the DES definitions in this respect. It is shown inthe next section that this difference is not relevant and that thefault signature concept is a unifying concept allowing one toformally compare the two approaches.

5 Formal ComparisonIn this section, we give the proof of equivalence between thediagnosability definition in the DES and CS approaches. Wefirst prove that the DES definition can be extended to multi-ple faults, which provides a better insight into the definitioninterpretation.

As noted before, definition (1) is stated for elements of F ,which corresponds to consider single faults. Let us extend itto multiple faults. The occurrence of a multiple fault f in atrajectory s is noted ∀Fi ∈ f, Fi ∈ s. The diagnosability con-dition (1) is verified for each Fi ∈ f with possibly differentni values. Taking the largest value of all these ni values asnf , it can be easily shown that definition (1) is equivalent todefinition (1′), which accounts explicitely for multiple faultsf = Fi.

∀f ∈ P(F ), ∃nf ∈ N,∀s ∈ Lsys/

(

∀Fi ∈ f, Fi ∈ s)

,∀t ∈ E∗/(st ∈ Lsys), ‖t‖ ≥ nf ⇒∀u ∈ P−1

obs

(

Pobs(st))

, ∀Fi ∈ f, Fi ∈ u

(1′)

This result shows that the DES diagnosability definition canbe given in terms of faults (instead of fault types), whethersingle or multiple, like the CS diagnosability definition.

The equivalence between diagnosibility definitions is nowproved by considering the assessment upon absence of faultsin a diagnosable discrete events system.

Let us consider a diagnosable system, thus verifying (1),and trajectories of arbitrary length, in particular maximal tra-jectories which correspond to maximal words as defined insection 3.2. Let us consider such a maximal trajectory s be-longing to the f -language Lf . It means that s contains atleast one occurrence of every single fault event composing fand no occurrence of any other fault. s belongs thus to Lmax

f

and its projection over the set of observable events belongs tothe f -signature. Now suppose that there exists a (maximal)trajectory u such that Pobs(u) equals Pobs(s) and thatu contains at least one occurrence of a fault Fj which doesnot belong to f . By (1), it implies that all trajectories sharingthe observable projection of u contain Fj , which is contra-dictory with our hypothesis about s. Thus, there does notexist any trajectory having the same observable projection ass and containing a fault not belonging to f . This proves that∀f1, f2 ∈ P(F ), f1 6= f2, Sig(f1) ∩ Sig(f2) = ∅ which isexactly the definition (2) given in 4.2 for the Continous Sys-tems.


6 Operational comparisonThis section contains an example that illustrates the conceptsintroduced before and compare the DES and CS approachesin an operational way. Bridges between state variables in theCS view and events in the DES view are provided and diag-nosability analysis is performed along the state-based diag-nosis and the dynamic diagnosis approaches.

6.1 Example

Tank 1

Tank 2

y1

y2

c1

c2

Pump

delay τ1 delay τ2

Figure 1: A water flow system

The system represented in Figure 1 is inspired of [Puig etal., 2005]. It is composed of two water tanks with heights y1

and y2, and a pump connected by a water flow channel. Bothtanks supply consumers c1 and c2. The delays τ1, respectivelyτ2, correspond to the time needed for the water to reach tank2from tank1, and tank1 from the pump. It has two operatingmodes : pump on and pump off. We consider faults in sensorsy1, y2, c1 and c2, named respectively Fy1, Fy2, Fc1 and Fc2.

The example is limited to single faults and it is assumedthat the system does not switch its operating mode betweenthe occurrence of a fault and the apparition of its symptoms,in order to simplify the models of the system.

6.2 Continuous model, state-based diagnosisThe discretized and linearized non-linear dynamic equationsare :

y1(t + ∆t) = y1(t)− k1c1(t) + k2upump(t− τ2)−k3uout(t)

uout(t) = k√

y1(t)∼= k4y1(t)

upump = k[a(h− y2)2 + b(h− y2) + c]

∼= k5 + k6y2(t)

y2(t + ∆t) = y2(t)− k7c2(t) + k8uout(t− τ1)−k9upump(t)

Where ∆t is the sampling time. upump being the flowthrough the pump, we can state that when the pump is off,we have upump(t) = 0, which can be achieved by choosingk5 = k6 = 0.

From these equations, it is possible to predict the values fory1 and y2 with :

y1(t + ∆t) = (1− k3k4)y1(t)− k1c1(t)+k2k6y2(t− τ2) + k2k5

y2(t + ∆t) = (1− k9k6)y2(t)− k7c2(t)+k8k4y1(t− τ1)− k9k5

From the equations above, two consistency tests can be ob-tained in the form of analytical redundancy relations :

r1(t + ∆t) = y1(t + ∆t)− y1(t + ∆t)= y1(t + ∆t)−

[

(1− k3k4)y1(t)−k1c1(t) + k2k6y2(t− τ2) + k2k5

]

r2(t + ∆t) = y2(t + ∆t)− y2(t + ∆t)= y2(t + ∆t)−

[

(1− k9k6)y2(t)−k7c2(t) + k8k4y1(t− τ1)− k9k5

]

Using these analytical redundancy relations and consider-ing that k5 and k6 are null when the pump is off, we deducethe fault signature matrices shown in figure 2.

The fault signature matrices indicate that the system is notdiagnosable since, for example, the observable (pon, s1 =1, s2 = 1) belongs to two fault signatures.

Fy1 Fy2 Fc1 Fc2

r1 1 1 1 0r2 1 1 0 1

Fy1 Fy2 Fc1 Fc2

r1 1 0 1 0r2 1 1 0 1

Pump on mode Pump off mode

Figure 2: Fault signature matrices for the system

6.3 Discrete event model, dynamic diagnosisFor the DES model of the system, the following events areused : pon,poff , fired when the pump is turned on or off ;FS fired when a fault occurs on sensor S ; r1, r2 fired whenanalytical redundancy relations r1 and r2, are violated.

The automaton is shown in Figure 3. An arc labelled a.brepresents two arcs labelled a and b, a leading to a state inwhich only b may occur.

a.b⇐⇒

a b

pon poff pon poff

Fy1.r1.r2

Fy2.r2.r1

Fc1.r1

Fc2.r2

Fy1.r1.r2

Fc1.r1

Fc2.r2

Fy2.r2.pon.r1

Figure 3: Automaton describing the system


Fault Signature∅ (pon.poff )∞

Fc1 (pon.poff )∗.r1.(pon.poff )∞

(pon.poff )∗.pon.r1.(poff .pon)∞

Fc2 (pon.poff )∗.r2.(pon.poff )∞

(pon.poff )∗.pon.r2.(poff .pon)∞

Fy1 (pon.poff )∗.r1.r2.(pon.poff )∞

(pon.poff )∗.pon.r1.r2.(poff .pon)∞

Fy2 (pon.poff )∗.r2.pon.r1.(poff .pon)∞

(pon.poff )∗.pon.r2.r1.(poff .pon)∞

Figure 4: Fault signatures (discriminant subwords arebolded).

From the automaton and following section 3.2, it is pos-sible to build the signatures for all the faults (see Figure 4).Recall that all the events except faults are observable. Thefault signatures are disjoint sets, the system is hence diagnos-able.

6.4 ResultsThis example shows that, although DES and CS diagnosabil-ity definitions are formally equivalent, operational diagnos-ability assessment critically depends on the nature of observ-ables.

In the CS approach, diagnosability is not achieved, as faultsignatures are not disjoint. (pon, r1 = 1, r2 = 1) is a signa-ture for both Fy1 and Fy2, and (poff , r1 = 0, r2 = 1) is asignature for both Fy2 and Fc2.

In the DES model, in the pump on mode, the symptomsr1 = 1 and r2 = 1 appear in the order (r1r2) for Fy1 and inreverse order (r2r1) for Fy2. Taking this order into accountpermits fault discrimination between Fy1 and Fy2 in dynamicdiagnosis. In addition, in the pump off mode, both Fy2 andFc2 are followed by the r2 symptom, but only in the case ofFy2, a pon command will be followed by the r1 symptom.Notice that diagnosability stands on the assumption that thepump will be turned on some time : it is only after the pon

command that the faults can be discriminated.

7 Related workIn the context of continuous systems, diagnosability analy-sis is stated in terms of detectability and isolability [Chenand Patton, 1994]. [Basseville, 2001] reviews several defi-nitions of fault detectability and isolability and distinguishestwo types of definitions, namely intrinsic definitions that donot make any reference to a particular residual generator andperformance-based definitions. In [Staroswiecki and Comtet-Varga, 1999], the conditions for sensor, actuator and compo-nent fault detectability are given for algebraic dynamic sys-tems and isolability is discussed. Diagnosability analysis forcontinous systems is often focussed on finding the optimalsensor placement as [Trave-Massuyes et al., 2001], whichuses a structural approach, or [Yan, 2004], and [Tanaka,1989]. [Frisk et al., 2003] also follow a structural approachand show how different levels of knowledge about the faultsmay influence the fault isolability properties of the system. In[Trave-Massuyes et al., 2004], a definition for diagnosability

in terms of fault signatures is proposed and is the one usedin this paper. In [Struss and Dressler, 2003], the state-basedapproach is extended to take into account several operatingmodes, for which state signatures may be different. In this sit-uation strong diagnosablity is hardly achieved and the paperproposes a definition to distinguish different discriminabilitysituations. Two faults may be not discriminable, necessarilydiscriminable or possibly discriminable depending on the in-tersection pattern of their associated observation sets. Thiswork is strongly related to the weak diagnosability definitionprovided in [Trave-Massuyes et al., 2004] for CS and [Rozeand Cordier, 2002] for DES. Comparing the formal defini-tions of weak diagnosability still remains to be done.

In the DES context, the first definitions have been proposedin [Sampath et al., 1995]. Checking diagnosability is com-putationally complex and polynomial time algorithms havebeen designed to cope with this problem [Jiang et al., 2001;Yoo and Lafortune, 2002]. In [Cimatti et al., 2003], for-mal verification of diagnosability is based on model-checkingtechniques. More recently, [Jeron et al., 2006] propose ageneralization of diagnosability properties to supervision pat-terns (describing various patterns involving fault events).

To our knowledge, there is no existing work comparingand/or unifying diagnosability approaches coming from theCS and DES communities. Some diagnosis algorithms havebeen proposed for hybrid systems but diagnosability condi-tions have not been exhibited for such systems and this is oneof our goals for future work. This paper is a direct contin-uation of the work done with the Imalaia group and devotedto bridge the gap between the two communities [Cordier etal., 2004] by comparing their respective approaches to model-based diagnosis.

8 ConclusionIn this paper, we propose a formal framework to compare inan adequate way the diagnosability definitions from the CSand DES community. The signature concept is generalizedto trajectories and allows us to prove equivalence of the diag-nosability definitions. The key issue is the way observationsare defined, in a static way in the CS approach and as par-tially ordered sets (sequences) in the DES approach. On onehand, when temporal information is necessary to discriminatefaults, the DES approach gives better results. On the otherhand, it requires to wait a certain amount of time, before get-ting the result. In practical applications, this delay has to beestimated and must be realistic wrt existing risks and deci-sions to be taken. Another view is to enrich CS signatureswith temporal information [Puig et al., 2005].

Having a common diagnosability analysis approach forboth state-based and dynamic diagnosis opens interesting per-spectives for analysing hybrid systems diagnosability. Someresults along this line can be found in [Bayoudh et al., 2006].

Future work will address the extention of the comparisonof DES and CS approaches for weak diagnosability defini-tions (as given in [Trave-Massuyes et al., 2004] for CS andin [Roze and Cordier, 2002] for DES). This is an importantissue because real world systems are generally weakly butnot strongly diagnosable. Hence weak diagnosability is more


relevant than strong diagnosability from a practical point ofview.

References[Basseville, 2001] M. Basseville. On fault detectability and

isolability. European Journal of Control, 7(8):625–637,2001.

[Bayoudh et al., 2006] M. Bayoudh, L. Trave-Massuyes,and X. Olive. Hybrid systems diagnosability by abstract-ing faulty continuous dynamics. In Proceedings of DX’06,2006.

[Benazera and Trave-Massuyes, 2003] E. Benazera andL. Trave-Massuyes. The consistency approach to theon-line prediction of hybrid system configurations. IFACConference on Analysis and Design of Hybrid Systems(ADHS’03), Saint-Malo (France), 2003.

[Benazera et al., 2002] E. Benazera, L. Trave-Massuyes, andP. Dague. State tracking of uncertain hybrid concurrentsystems. In Proceedings of the International Workshop onPrinciples of Diagnosis(DX’02), pages 106–114, 2002.

[Chen and Patton, 1994] J. Chen and R.J. Patton. A re-examination of fault detectability and isolability in lineardynamic systems. In Proceedings of the 2nd SafeprocessSymposium, Helsinki (Finland), pages 567–573, 1994.

[Cimatti et al., 2003] A. Cimatti, C. Pecheur, and R. Cavada.Formal verification of diagnosability via symbolic modelchecking. Proceedings of IJCAI’03, pages 363–369, 2003.

[Cordier et al., 2004] M.-O. Cordier, P. Dague, F. Levy,J. Montmain, M. Staroswiecki, and L. Trave-Massuyes.Conflicts versus analytical redundancy relations : A com-parative analysis of the model-based diagnostic approachfrom the artificial intelligence and automatic control per-spectives. IEEE Transactions on Systems, Man and Cy-bernetics - Part B., 34(5):2163–2177, 2004.

[Frisk et al., 2003] E. Frisk, D. Dustegor, M. Krysander, andV. Cocquempot. Improving fault isolability properties bystructural analysis of faulty behavior models: applicationto the DAMADICS benchmark problem. In Proceedingsof IFAC Safeprocess’03, Washington, USA, 2003.

[Gupta et al., 2004] S. Gupta, G. Biswas, and J. Ramirez. Animproved algorithm for hybrid diagnosis of complex sys-tems. In Proceedings of DX’04, 2004.

[Jeron et al., 2006] T. Jeron, H. Marchand, and M-O.Cordier. Motifs de surveillance pour le diagnosticde systemes evenements discrets. In Proceedings ofRFIA’2006, 2006.

[Jiang et al., 2001] S. Jiang, Z. Huang, V. Chandra, andR. Kumar. A polynomial time algorithm for diagnosabil-ity of discrete event systems. IEEE Transactions on Auto-matic Control, 46(8):1318–1321, 2001.

[Patton et al., 1989] R.J. Patton, P. Franck, and R. Clark.Fault diagnosis in dynamic systems - Theory and Appli-cations. Prentice Hall International, London UK, 1989.

[Puig et al., 2005] V. Puig, J. Quevedo, T. Escobet, andB. Pulido. On the integration of fault detection and iso-lation in model based fault diagnosis. In Proceedings ofDX’05, pages 227–232, 2005.

[Roze and Cordier, 2002] L. Roze and M.-O. Cordier. Diag-nosing discrete-event systems : extending the “diagnoserapproach” to deal with telecommunication networks. Jour-nal on Discrete-Event Dynamic Systems : Theory and Ap-plications (JDEDS), 12(1):43–81, 2002.

[Sampath et al., 1995] M. Sampath, R. Sengputa, S. Lafor-tune, K. Sinnamohideen, and D. Teneketsis. Diagnosabil-ity of discrete-event systems. IEEE Transactions on Auto-matic Control, 40:1555–1575, 1995.

[Staroswiecki and Comtet-Varga, 1999] M. Staroswieckiand G. Comtet-Varga. Fault detectability and isolability inalgebraic dynamic systems. Proceedings of the EuropeanControl Conference, 1999.

[Struss and Dressler, 2003] P. Struss and O. Dressler. A tool-box integrating model-based diagnosability analysis andautomated generation of diagnostics. In Proceedings ofDX’03, 2003.

[Tanaka, 1989] S. Tanaka. Diagnosability of systems andoptimal sensor location. In R.J. Patton, P. Franck, andR. Clark, editors, Fault diagnosis in dynamic systems -Theory and Applications, chapter 5, pages 21–44. PrenticeHall International, London UK, 1989.

[Trave-Massuyes et al., 2001] L. Trave-Massuyes, T. Esco-bet, and Rob Milne. Model-based diagnosability and sen-sor placement application to a frame 6 gas turbine subsys-tem. Proceedings of IJCAI’01, pages 551–556, 2001.

[Trave-Massuyes et al., 2004] L. Trave-Massuyes, T. Esco-bet, and X. Olive. Model-based diagnosability. InternalReport LAAS N04080, Janvier 2004, 12p. to appear inIEEE Transactions on System, Man and Cybernetics, PartA, 2004.

[Williams and Nayak, 1996] B. C. Williams and P. P. Nayak.A model-based approach to reactive self-configuring sys-tems. Proceedings of AAAI-96, Portland, Oregon, pages971–978, 1996.

[Yan, 2004] Y. Yan. Sensor placement and diagnosabilityanalysis at design stage. MONET Workshop on Model-Based Systems at ECAI’04, August 22-26, Valencia, Spain,2004.

[Yoo and Lafortune, 2002] T. Yoo and S. Lafortune.Polynomial-time verification of diagnosability ofpartially-observed discrete-event systems. IEEE Trans. onAutomatic Control, 47(9):1491–1495, 2002.


Exploiting independence in a decentralised and incremental approach of diagnosis

Marie-Odile CordierIrisa Dream

Rennes – [email protected]

Alban Grastien

Irisa DreamRennes – France

[email protected]

Abstract

It is now well-known that the size of the mod-el is the bottleneck when using model-based ap-proaches to diagnose complex systems. To answerthis problem, decentralized/distributed approacheshave been proposed. The global system model isdescribed through its component models as a setof automata and the global diagnosis is computedfrom the component diagnoses (also called local di-agnoses). Another problem, which is far less con-sidered, is the size of the diagnosis itself. However,it can also be huge enough, especially when dealingwith uncertain observations. It is why we recentlyproposed to slice the observation flow into temporalwindows and to compute the diagnosis in an incre-mental way from these diagnosis slices.In this context, we define in this paper two in-dependence properties (transition and state inde-pendence) and we show their relevance to get atractable representation of diagnosis. To illustratethe impact on the diagnosis size, experimental re-sults on a toy example are given.

1 Introduction

In this paper, we are concerned with the diagnosis of discreteevent systems [Cassandras and Lafortune, 1999] where thesystem behaviour is modeled by automata. This domain isan active domain since the seminal work proposed by [Sam-path et al., 1996]. It consists in finding what happened to thesystem from existing observations as in [Baroni et al., 1999;Cordier and Thiebaux, 1994; Console et al., 2000; Lunze,1999; Roze and Cordier, 1998; Cordier and Largouet, 2001].A classical formal way of representing the diagnosis prob-lem is to express it as the synchronised product of the systemmodel automaton and an observation automaton. This formaldefinition hides the real problem which is to ensure an effi-cient computation of the diagnosis when both the system iscomplex and the observations possibly uncertain.

It is now well-known that the size of the system mod-el is one bottleneck when using model-based approaches to

This work was partially made in NICTA, Canberra (Australia)

diagnose complex systems. To answer this problem, de-centralized/distributed approaches have been proposed [Pen-cole and Cordier, 2005; Lamperti and Zanella, 2003; Ben-veniste et al., 2005]. Instead of being explicitly given, thesystem model is described through its component models ina decentralized way. From these local models, local diag-noses are computed to explain local observations. Whenit is needed to take a global decision, a global diagnosisis computed by merging local diagnoses in order to takeinto account the synchronisation events which express thedependency relation which may exist between the compo-nents. This merging step can be costly and merging strate-gies have been proposed as in [Pencole and Cordier, 2005;Lamperti and Zanella, 2003]. The main result gained fromthese work is the importance of detecting concurrent subsys-tems in order to limit both the computation time and the rep-resentation size of the diagnosis.

A problem, which is far less considered, is the size of theobservation flow, which directly impact the size of the di-agnosis itself. However, it can also be a problem, especial-ly when dealing with uncertain observations as already re-marked by [Lamperti and Zanella, 2003]. Moreover, increas-ing the observation period decreases the chance of findingindependent subsystems. It is why we recently proposed toslice the observation flow into temporal windows and to com-pute the diagnosis in an incremental way from these diagno-sis slices [Grastien et al., 2005]. The idea is then to detectindependent subsystems on these limited subperiods and toexploit these properties to get an economical representationand computation of diagnosis.

In this context of incremental and decentralised diagnosis,we define in Section 2 two independence properties (transi-tion and state independence) on automata and we show theirrelevance to get a tractable representation of diagnosis. Thefirst one, transition independence expresses that two modelsdo not share any synchronisation events. The second one, s-tate independence, expresses that when decomposing a modelinto two submodels, no constraints on their initial states havebeen lost. We first examine in Section 3 the purely decen-tralised case and propose to represent the diagnosis by a set oftransition-independent diagnoses. We show in Section 4 thespecific problem related to the incremental computation andpropose to use an abstract description of trajectories, fromwhich the set of final states and the trajectories of the global


diagnosis can be easily retrieved. To illustrate the impact onour proposal on the diagnosis size, experimental results on atoy example are given in Section 5. We conclude in show-ing that the next step is to automatically find the best slicingpoints in order to maximally exploit the two independenceproperties which were defined.

2 Preliminaries and independence propertiesWe suppose in this paper that the behavioural models are de-scribed by automata. We thus begin by giving some defini-tions concerning automata which are needed in the followingsections. Then, we define the independence properties thatare central in this paper. Lastly, we recall the diagnosis defi-nitions and state some hypotheses.

2.1 Automata, synchronisation and restrictionAutomata are used to describe the behavioural models of thesystem components. Let us recall the definition and introducethe notations.

Definition 1 (Automaton).An automaton

is a tuple where is a (finite) set of states, is a set of labels, is a (finite) set of transitions !#"$ ,

with %& , ' is the set of initial states, ( is the set of final states.

We suppose that )*,+- , the transition ./0# exists. Thelabel on the transition 13240 0#"5 indicates which eventstrigger the transition.

A trajectory is a path in the automaton joining an initialstate to a final state.

Definition 2 (Trajectory).A trajectory on an automaton

26 is a double

sequence of states and transitions traj 2798;:$<=?>A@B@B@ :DC=?> BEsuch that: 8 + , E + , and )*FG9HJILKM NHBHOP+ .

The set of trajectories of an automaton

is denotedTraj . In the following, as we are interested mainly bytrajectories and states passed through, the automata we con-sider are trim automata [Cassandras and Lafortune, 1999], i.eautomata such that all the states belong at least to one tra-jectory. The trim operation transforms an automaton into itscorresponding trim automaton by removing the states that donot belong to any trajectory. Remark that a trim operationdoes not remove any trajectory. It can however shrink the setof initial states and of final states.

Let us consider the trim automaton in Figure 1. The initialstates are represented by an arrow with no origine state, and

the final states by a double circle. Then, Q3RS <T=?>VU RWYX T=Z>\[ RS <T=?>V]is a trajectory.

Let us consider synchronisation of two automata K and_^

. The events which are common the transition labels of K and ^

, i.e. KP` ^ , are called synchronisation events.To be synchronizable, two transitions must either be labeled

PSfrag replacements

ab

cde fg/h

ikj

i9j

ikl

g/hgjnmiMh

Figure 1: Example of automaton

by events which are not synchronisation events, or have thesame synchronisation events. The synchronisation operationon two automata builds the trim automaton where all the tra-jectories of both automata which cannot be synchronised ac-cording to the synchronisation events are removed.

Definition 3 (Synchronisation of automata).Given

K 2 O,K#0oKMpKq9KqrK9 ando^ 2 ^ 0 ^ 0 ^ 0 ^ 0 ^ two automata. The synchronised

automaton of K and

^, denoted

Kts ^ , is the trim au-tomaton

2uOwvPxvG0yvzv|2~qF such that: 2 K ^ , 62 K| ^ , 2 K 0 ^ !9M"K 0M"^ , K ^ 9 K 0 K M"K + KG ^ 0 ^ M"^ P+ ^ KM` K` ^ 2 ^ ` Kn` ^ 0 L2 K ^M , 29K ^ , 2 K ^ .

The set of states v is included in K ^ as some states(and transitions) can be removed by the trim operation. In thesame way, the initial (resp. final) states of

, v (resp. v ),

are included in K ^ (resp. K ^ ).Figure 2 gives an example of synchronisation. The automa-

ton in Figure 1 and the automaton on the top of Figure 2 aresynchronised leading to the automaton on the bottom. Thesynchronising events are the H events.

PSfrag replacements

akb#c

d e fiMh

j

h

ikj

ikj

PSfrag replacements

9

k9 kB 9

kBB

q yk 9k

K ^ MK

K ^

K ^

^ K ^ K

K ^

Figure 2: Example of synchronisation


The restriction operation of an automaton removes from all the initial states which are not in the specified set of states.Due to the trim operation, all the states and transitions whichare no more accessible are removed from .

Definition 4 (Restriction).Let

26O000Z0 be an automaton. The restriction ofby the states of " , denoted

" , is the automaton "G2"v o"v 0"v w"v 2'~qF O000 ` " .

2.2 Transition and State-independency

The transition-independency property states that two (ormore) automata do not have any transition labeled with syn-chronisation events.

Definition 5 (Transition-independency). K 2 K 0 K K K 0 K and ^ 2 O ^ ^ 0 ^ 0 ^ 0 ^

are transition-independent (TI) iff every label on a transitionof pK or ^ is such that ` oK ` ^ 2 . .

For two TI automata, the synchronisation operation is e-quivalent to a shuffle operation.Property 1: Let

K ando^

be two transition-independentautomata and

2 Kzs ^ . The final (resp. initial) statesof

correspond exactly to the Cartesian product of the final(resp. initial) states of

K ando^

.PSfrag replacements

a b c akb

c

aa9 ba c/ak

akb# bb cnb

a9c b#c ccn

gh gj h j

Figure 3: Example of two TI automata

Figure 3 gives an example of two automata K and

^that

synchronise on kH events. Since none of the automata has atransition labeled with a H event, the automata are transition-independent. The synchronisation

2 KZs ^ is represent-ed on the bottom of the figure (for simplicity, the labels on thetransitions are not represented). We see that the set of initialstates (resp. ) is the Cartesian product of K ( K ) and ^( ^ ). Figure 1 and Figure 2 give an exemple of two automata K and

o^that are not transition-independent as they contain

transitions with kH events. The set of final states of the syn-chronisation is only included in the Cartesian product of Kand ^ .

In the next section 3, we are interested in representing asystem model in a decomposed way by the set of its subsys-tems models, the main property being that it must be possibleto retrieve the first one from the other ones by a composition(synchronisation) operation. In the following, we give thedefinitions of subsystems and the properties the set of sub-systems models must satisfy to get a safe representation ofthe system model.

Definition 6 (System and subsystems).A system can be described by its set of components . Asubsystem is a non-empty set of components: w .

A subsystem model describes the subsystem behaviourand is described by an automaton

W 2 W 0 W 0 W 0 W W where W is the set of events that can occur on this subsys-

tem. Some of these events are shared with other subsystemsand are synchronisation events between subsystem models.

Let us now see the properties a set of subsystem models hasto satisfy to be a good representation of the system model.We first define what we call a decomposition of

in two

automata.

Definition 7 (Decomposition of

).Two automata

K and ^

are said to be a decomposition ofan automaton

iff 26 K|s ^ where are the initial

states of

.

Remark that we do not require that we get exactly

bysynchronizing

K and ^

, but only a super-automaton of

(i.e. an automaton that contains all the trajectories of

andpossibly more). In general, we have thus that the initial (resp.final) states of

are included in qK3 ^ (resp. |K3 ^ ).

The idea is that, when you describe a system (whose mod-el is

) by its subsystems, you have to describe the subsys-

tem behaviours, which is done through the subsystem models(here

K and_^

) and the way the subsystems interact, whichis done through the synchronisation events. Moreover, youhave to do it in a proper way given by the Definition 7. But apoint is still missing, as the constraints existing between thesubsystem initial states in order to represent the system initialstates can be lost in the decomposition of

. It is why, when

composing K and

^by a synchronisation operation, we do

not get always back exactly

, but an automaton including

.The state-independency property is a property of a decom-

position K and

^which ensures that we get exactly

.

Definition 8 (State-independency decomposition wrt

).Two automata

K 2 K 0 K K 0 K 0 K and_^ 2( ^ 0 ^ ^ ^ 0 ^ are said to be a state-independentdecomposition wrt

(SI v ) iff they are a decomposition of

and if

2 K s ^ .Remark that, if

K and ^

have both a unique initial state,and if they are a decomposition wrt

, then

K ando^

are astate-independent decomposition wrt

.

Let us suppose two automata K and

^which are a state-

independent decomposition wrt

and are both transition-independent. In this case, due to Property 1, the initial andfinal states of

can be easily computed as the Cartesian prod-

uct of the initial and final states of K and

^. This property

means that, when you are mainly interested in these states,


you do not have to perform the synchronisation operation onthe automata, which is costly in space.Property 2: Let

K ando^

be two transition-independentautomata forming a state-independent decomposition wrt

.

The initial (resp. final) states of

are exactly the Cartesianproduct of the initial (resp. final) states of

K and ^

.When

K ando^

are not a state-independent decomposi-tion wrt

, the only way not to lose any information is to add

as extra information the initial states of

to the decomposedrepresentation of

.

2.3 DiagnosisLet us recall now the definitions used in the domain ofdiscrete-event systems diagnosis where the model of the sys-tem is represented by an automaton. 1 8 corresponds to thestarting time and 1 to the ending time of diagnosis.

Definition 9 (Model).The model of the system, denoted Mod, is an automaton.

The model of the system describes its behaviour and the tra-jectories of Mod represent the evolutions of the system. Theset of initial states Mod is the set of possible states at 18 . Wesuppose as usual that Mod 2 Mod (all the states of thesystem may be final). The set of observable events is denoted Mod

Obs Mod.

Let us turn to observations represented by an automaton,where the transition labels are observable events of Mod

Obs .

Definition 10 (Observation automaton).The observation automaton, denoted Obs, is an automatondescribing the observations emitted by the system during theperiod

1Y8n1 .Even if usually the observations are subject to uncertainties,we consider in the following that they are represented as aunique sequence of observable events. It allows us to simplifythe presentation but it can be extended to the case of uncertainobservations as we did for instance in [Grastien et al., 2005].

The diagnosis, denoted

, is a trim automaton describingthe possible trajectories on the model of the system compati-ble with the observations sent by the system during the period 1 8 1 . The diagnosis is then formally defined as resultingfrom the synchronisation operation between the system mod-el Mod and the observation automaton Obs .

Definition 11 (Diagnosis). The diagnosis, denoted

, is atrim automaton such that

2 Mod s Obs

3 Improving diagnosis representation in adecentralised approach

Real-world systems can often be seen as a set of (possiblyabstract) interconnected components. Each component hasa simple behaviour but the connections between the com-ponents can lead to a complex global behaviour. For thisreason, the size of a global model of the system is gen-erally untractable and no global model can be effectivelybuilt. To answer this problem, decentralised/distributed ap-proaches have been proposed [Lamperti and Zanella, 2003;

Pencole and Cordier, 2005; Benveniste et al., 2005]. In thisarticle, we consider the decentralised approach of Pencoleand Cordier. This approach is pictured on Figure 4.

The idea is to describe the system behaviour in a de-composed way. The so-called decentralised model is thus

–Mod 27 Mod K @9@B@ Mod where Mod H is the behaviouralmodel of the component H . The decentralised model is builtto be a decomposition of the global model Mod. The glob-al model can thus be retrieved by Mod 2A Mod K3s @9@B@ sMod / where is the set of initial states of Mod. Weconsider that the global model has a unique initial state (ifit is not, an initialization transition can be added to ensureit) and that the component models have also a unique initialstate. They are thus a state-independent decomposition wrtMod and we have Mod 2 Mod K s @B@9@ s Mod .

The observations Obs can generally be decentralised asfollows:

–Obs 2 Obs K @9@B@ Obs such that Obs H con-

tains the observations from the component H and such that:Obs 2 Obs K s @9@B@ s Obs .

Given the local model Mod H and the local observationsObs H , it is possible to compute the local diagnosis

2Mod H#s Obs H . These diagnoses represent the local behavioursthat are consistent with the local observations. It was shownin [Pencole and Cordier, 2005] that the decentralised diagno-sis is a decomposition of

. As there is a unique initial state,

it is also a state-independent decomposition. It is then possi-ble to compute the global diagnosis of the system by mergingall the local diagnoses as follows:

2 s @B@9@ s .

PSfrag replacements Mod hqm9m Mod

Mod

mBm

local diagnosis

synchronisation merging

diagnosis

Figure 4: Principle of the decentralised computation of thediagnosis

A first improvement in the diagnosis computation is that,rather than directly merging all the local diagnoses together, itis possible to incrementally compute the global diagnosis bysuccessive synchronisation operations. Let K and ^ be twodisjoint subsystems (possibly being components) and let 2 K ^ be the subsystem that contains exactly K and ^ . Thesubsystem diagnosis

W can be computed by synchronising

the two subsystem diagnosesW < and

W X :

W 2

W < s

W X .The diagnosis of the system is

2 .

The next point is that, in spite of the constraints generatedby the observations, the size of the global diagnosis can stillbe large. It is mainly due to the fact that merging concurrentdiagnoses corresponds to compute the shuffle of two automa-ta which is costly in terms of number of states and transitions(see for instance Figure 3). A second improvement to avoid


these costful shuffles is to represent the system diagnosis as aset of transition-independent subsystem diagnoses.

Definition 12 (Decentralised diagnosis).A decentralised diagnosis

–

is a set of subsystem diagnoses W < @9@B@ W

such that W

is the diagnosis of the subsystem H , MqKM @9@B@ is a partition of the system , and )*F + Q @9@B@ , F2 W

andW are

transition-independent.

As seen before, a decentralised diagnosis– 2 W < @9@B@

W

is a decomposition of the global diagnosis.It can thus be computed, if needed, by synchronising all thesubsystem diagnoses, or equivalently by a shuffle operationas 2

W < s @B@9@ sW . Its final states can be obtained by

a simple Cartesian product on the final states of allW.

Algorithm 1 shows how to compute the decentralised di-agnosis from the local (component) diagnoses. Until all pairsof diagnoses are transition-independent, the algorithm choos-es two transition-dependant diagnoses and merges them. Letus remark that the result is not unique and depends on themerging strategy which is also very important from a compu-tation time point of view. It was proposed in [Pencole et al.,2001a] to use a dynamic strategy, based on first synchronis-ing the subsystem diagnoses which interact the most, in orderto remove at first as many trajectories as possible.

Algorithm 1 Algorithm to compute a decentralised diagnosis

input: local diagnoses < @9@B@ ?– 2 < @B@B@ ?while W <

W X +

–

such thatW < and

W X are not

transition-independent do– 2

– W <

W X

2 K ^W 2

W < s

W X

– 2

– W

end whilereturn:

–

4 Improving diagnosis representation in adecentralised and incremental approach

In the previous section, we considered that the diagnosis wascomputed on a period. This means that the observation au-tomaton represents the observations from the beginning to theend of the period, and the diagnosis represents the behaviourduring the whole period.

We have seen in the previous section that exploitingtransition-independence enables to reduce the size of the di-agnosis representation. However, when we consider a longperiod, as this may be the case when you have log files todiagnose, it is very seldom that you have independent be-haviours since each component eventually interacts with mostof its neighbours. It is why we recently proposed to slicethe observations into temporal windows and to incrementally

compute the diagnosis for each temporal window [Grastienet al., 2005]. Given these diagnoses on small windows, it cannow be expected to have independent behaviours that can beefficiently represented by a decentralised diagnosis.

The problem with the incremental approach is that it be-comes difficult to ensure the state-independency property ofthe decomposition. This property allowed us, due to Property2 of Section 2, to get the initial and final states of the globaldiagnosis without computing it explicitly. To keep the benefitof the decentralised representation of diagnosis, we proposea solution that enables us to get the initial and final statesneeded for an incremental diagnosis without having to mergediagnoses, even when state-independency is not satisfied.

Let us first present a formalism-free generalization of theincremental computation by automaton slicing. We explainthen why we lose the state-independency property and endby proposing a solution to this problem.

4.1 Incremental diagnosisThe incremental diagnosis relies on the notion of temporalwindows first introduced in [Pencole et al., 2001b]. Fora detailed presentation of the diagnosis by slices, refer to[Grastien et al., 2005]. Let

1 8 01 be the diagnosis period and1Y8 @B@9@ 1 be a sequence of dates. The temporal window His the period

1 HIxK 1 H . Let ObsK @B@9@ Obs

be a slicing of the

observations Obs. It is shown in [Grastien et al., 2005] that,given a slicing of the observations Obs 2 Obs

K @B@B@ Obs

,a diagnosis

on the period 1 8 01 can be computed as

a sequence of diagnoses K @B@9@ ) corresponding tothe windows H . It is also shown that, given this se-quence of automata, it is possible, only if needed, to recon-struct the original automaton

by appending the slices.The trajectories can be computed as follows: A trajectoryon this sequence of automata is a sequence of trajecto-

ries trajH 2 H8 :

<=?> @B@9@ :

C

=?> HE H + Traj H where )?F , HK8 2' HE H .

Let us reduce now the problem to two slices and supposewe have computed a diagnosis

HJILK for the period 1!8n01 HJILK .

We do not presume the way this diagnosis is represented andwill come back on this point later. We want to compute thediagnosis

H by taking into account the observations ObsH

on the next temporal window H . Let us first see how thediagnosis

H can be computed. We can state that K 2 Mod s Obs

K, and )*F2 Q , H 2 ModI s Obs

H HIxK where ModI 2

O Mod 0 Mod Mod Mod Mod and HIxK is theset of final states of

HIxK .The F th diagnosis of the sequence can be theoretically com-

puted by the synchronisation of the model (where all statesare initial Mod

I) with the observations Obs

Hof the window.

It is however important from a computational point of view torestrict the set of initial states with the set of final states HIxKof the previous automaton. It is then possible to describe

Has the sequence

HIxK H . Remark that the set of final statesof H is exactly the set of final states of

H .


4.2 Loss of the state-independency propertyOur goal is to use, for this sequence of diagnoses, a decen-tralised computation based on a decentralised model, and adecentralised representation similar to the one proposed inSection 3 based on transition-independent diagnoses.

We want to compute H in a decentralised way, which

means that we build the local diagnoses before merging them(see Algorithm 1). The diagnosis of the component inthe temporal window H is computed as follows :

H 2 Mod

I s ObsH HIxK where

HIxK is the projectionoperation of the final states of

HJILK on component .By Algorithm 1, we get a set of transition-independent sub-

system diagnoses. The problem that appears here is that thisset is a decomposition of

H , but it can not be ensured that itis a state-independent decomposition. Contrary to the case ofSection 3, it can be the case that existing links with the initialstates of the other components are lost when projecting HIxKon a component .

This is illustrated by Figure 5. The figure represents thediagnosis of two components. These components can be ei-ther in a k state or aulty state. The figure presents a two-window diagnosis, each in a box. During the first window,one of the two components failed but it is not possible to de-termine which component did. The initial states of each com-ponent at the beginning of the second window are obtainedby projecting the final states of the first window and they are and for one component and w" and w" for the other one.Nothing happened during the second window. The algorith-m proposes thus the two local diagnoses (up and bottom inFigure 5, right) but we can see that the links between the ini-tial states were lost during the projection, and then we geta decomposition of the global diagnosis which is not state-independent. We have

^ 0 3"O0w" . To get

the exact final states ^ , the only solution would be to syn-

chronize the local diagnoses and then to use the restrictionoperation with the final states of the first window as argumen-t, which is not an economical way as expected. We proposebelow a solution to this problem.

PSfrag replacements

h j

Figure 5: Example of loss of information in a naive decen-tralised representation of the incremental diagnosis

4.3 TI + abstract representationThe solution we propose is to add an abstract representationof the diagnosis to the set of transition-independent subsys-tem diagnoses. We first define what is an abstration, and thenshow that it allows us to keep the benefit of the decentralised

representation even when it is not state-independent wrt theglobal diagnosis as shown in 4.2.

An abstraction of an automaton only preserves as states theinitial and final states of the original automaton, and abstractsthe trajectories existing in the original automaton in a transi-tion labeled by . .Definition 13 (Abstraction).Let

2 be a trim automaton. The ab-straction of

, denoted Abst , is the (trim) automaton "*2Ow""O_"0"w"$ where: w"?2 , "?2. ,

2 ./0M"5 traj 2 8 : <=*> @B@B@ :=?> +

Traj 8 2' 2M" , "Z2 , and w"Z2 .

The following two properties can be easily proved.Property 3: Let

K and ^

be two transition-independent au-tomata. Then, Abst K s Abst ^ G2 Abst K s ^ .Property 4: Let

K and ^

be two transition-independentautomata and let be a set of states. Then, Abst K sAbst _^ x2 Abst K s o^ N .

The main problem with the loss of the state-independencyproperty is that we can no longer get the set of final statesby a mere Cartesian product on the final states of the subsys-tem diagnoses. The abstraction allows us to compute themwithout having to perform the expensive synchronisation ofthe subsystems diagnoses. In fact, the final states are directlycomputed as the Cartesian product of the final states of theabstraction of the subsystems diagnoses which is a lot lessexpensive.

Let us consider the F th window H . We know the set Hof initial states of the current window as they are the final s-tates of the preceding one. This set can be in a decentralisedform, ie described by a set of states k HK @B@9@ 0 H such that H 2 HK @B@9@ & H . As explained in 4.2, the subsystemdiagnoses are computed using Algorithm 1 which returns aset

– H 2 H W < @9@B@

HW

of transition-independent diag-noses. We need to get the final states as they are used torestrict the initial states of the next window, but in absence ofstate-independency property, they can no longer be computedfrom the final states of

– H (in fact H –

).

To build the abstract representation, we propose to use Al-gorithm 2. To obtain the set of final states, the idea is, insteadof synchronising the transition-independent automata, to syn-chronise their abstractions. Then, a restriction is performedusing the initial states H , to get the exact final states H .

As at the end all the abstract subsystem diagnoses com-posing

91 – H are state-independent, we know that the set

of initial states of H is the set of initial states of

91 – H .

Moreover, we have the following property :Property 5: The set of final states of

H is the set of finalstates of

91 – H .


Algorithm 2 Computation of the abstract representation ofthe diagnosis of

Hinput: local diagnoses

– H 2 H W < @B@9@

HW

+ the set of initial states 91 – H 27 Abst H W 9)

HW +

– H

while 91 H W < 91 H W X + 91 –

H such that 91 H W < and 91 H WYX are not state-independent wrt

B1 H W < s 91 H W!X

K ^ do 91 – H 2 B1 –

H 91 H W < 91 H WYX

2 K ^ 91 H W 26 B1 H W < s

91 H W X MK ^ 91 –

H 2 B1 – H 91 H W

end whilereturn:

91 – H

It is then possible to get the set of final states of H with-

out synchronising the transition-independent subsystem di-agnoses. The decentralised representation of diagnosis on atemporal window is thus the set of its transition-independentsubsystem diagnoses and the set of its transition and state-independent abstract diagnoses.

5 ExperimentsIn this section, we present an experimentation of the diag-nosis using the decentralised and incremental approach. Wepresent the system to diagnose and then give the results.

5.1 SystemThe system we want to diagnose is a network of Q intercon-nected components as presented on Figure 6.

PSfrag replacements

Q

U

[

]

Q

QnQ

Qk

Q UQ

Figure 6: Topology of the network

Each component has the same behaviour: when a fault oc-curs on a component, it reboots and forces its neighbours toreboot too. When asked to reboot, the component sends theobservation 1 H (where F is the number of the compo-nent), and when the reboot is finished, it sends the observa-tion H . When a component is asked to reboot, itcan be asked to reboot by another component (and then sendthe 1 H observation) at the beginning of the rebootingprocess.

The model is presented Figure 7. The reboot! messageindicates that reboot is sent to all the neighbours, and thereboot? message indicates that a neighbour sent the rebootmessage to the component. So, for example, on component Q ,there are three transitions from state to state respectivelylabeled by reboot K,K , IReboot K , reboot

^ K , IReboot K ,and reboot K , IReboot K since components , U and Q areneighbours from component Q .

PSfrag replacements

fault,reboot!,IReboot

end,IAmBack

reboot?end,IAmBack

reboot?,IReboot

rebooting

reboot?, IReboot reboot?

Figure 7: Model of a component

Let us remark that the decentralised modeling contains ex-actly Q states, while the global model would containnearly K u [ states.

5.2 ResultsThe algorithms were programmed in Java, and run on a Linuxmachine with a 1.73 GHz Intel processor. We deal with 45observations. The experiments results are given Table 1.

The first experiment was made with a unique temporal win-dow as presented in section 3. The computation was morethan 26 mn and produced automata, one of which contain-s [ states and [[ U] transitions. It can be noted that

taking into account the transition-independence property ofdiagnoses in the decentralised representation is interesting asfour independent subsystems are identified. It prevents fromcomputing the shuffle for these subsystem diagnoses which iscertainly a very good point. However, due to the length of thewindow, one of the automata is still very large.

Using the method described in section 4, the observationsare now sliced into temporal windows. The diagnosis wascomputed in less that Q second, producing U small automata.The number of states is , that is [ of the number of statesused in the previous automaton, and the number of transition-s is U which represents less than Q of the transitionsof the previous automaton. It confirms that slicing observa-tions is beneficial in that it allows to increase the number ofindependent subsystems, and thus diagnoses.

no slicing Q st slicing nd s. U rd s.nb states

[ UoU [nb trans [n[ U] U [U Q [ Q nb auto U [ Q ]time ] mn [[ s Q s Q s U mn [ s

Table 1: Results of the experimentations

Let us stress now the importance of the slicing on the goodresults of the method. In a third experiment, the first tempo-ral window of the previous experiment was sliced into two.It can be noted that the number of states of the diagnosis in-creased by about and the number of transitions by U .Moreover, the computation time increased to 10 seconds. Thereason is that you sometimes need to have enough observa-tions on a subsystem to conclude that this subsystem did notcommunicate with another subsystem.


In a fourth experiment, two temporal windows of the firstwindow are merged into one unique window. The corre-sponding computation time is then nearly minutes and thenumber of states and transitions exploded. It confirms thatthe slicing operation is a critical operation and that decidingwhat is the best slicing is an appealing perspective.

6 ConclusionIn this paper, we consider the diagnosis of discrete-event sys-tems modeled by automata. To avoid the state-explosionproblem that appears when dealing with large systems, weuse a decentralised computation of the diagnosis. Thisapproach consists in dividing the system into transition-independent subsystems. We show that the global diagno-sis can be safely represented by the set of diagnoses of thesetransition-independent subsystems. An important point isthat the transitions can be easily computed from this decen-tralised representation by relying on the state-independencyproperty which we define. It is then clear that the smaller thetransition-independent subsystems are, the best the diagnosiscomputation is, both according to time and space efficiency.

When the period of observation is important, very seldomdo you have independent subsystems, since each componen-t eventually interacts with most of its neighbours. We pro-pose thus to slice the diagnosis period into temporal windows,in order to get, on these windows, transition-independentsubsystems. The problem that appears is that the state-independency property does not hold anymore. We are thenno more able to get the exact final states. On the one hand,such a set of diagnoses for transition-independent but notstate-independent subsystems gives us only a superset ofthe global diagnosis, which is not satisfying. On the otherhand, computing the set of transition-independent and state-independent subsystem diagnoses would be too expensive.We thus propose to keep the decentralised diagnosis rep-resentation (a set of transition-independent subsystem diag-noses), and to add an abstract representation of both state- andtransition-independent diagnoses, enabling us to compute inan economic and efficient way the final states. We show thatwe get a safe representation of the global diagnosis.

Some points need to be analysed in more details. As canbe seen in Algorithm 2, it is necessary to have an efficientway to check whether two abstract diagnoses are or not state-independent, and we are currently working on this point. An-other concern is about the slicing. As shown in section 5, abad slicing can lead to a very little benefit. An interestingprospect would be to automatically find the best slicing toobtain a diagnosis represented as efficiently as possible.

In this article, we considered that the observations weresure and ordered. In real-world systems, this hypothesis gen-erally does not hold, and we proposed to represent the obser-vation by an automaton [Grastien et al., 2005]. The results ofthis article can be extended to cope with that. A more difficultcase to consider is when you have to slice on-line the observa-tions, while not all the observations are yet received. Finally,since we deal with state-spaces that are different from a win-dow to the next, it should be interesting to use these results forreconfigurable systems, the topology (the set of components

and the connections between them) of which can evolve alongtime, as considered for instance in [Grastien et al., 2004].

References[Baroni et al., 1999] P. Baroni, G. Lamperti, P. Pogliano, and

M. Zanella. Diagnosis of large active systems. Artificial Intel-ligence, 110:135–183, 1999.

[Benveniste et al., 2005] A. Benveniste, S. Haar, E. Fabre, andCl. Jard. Distributed monitoring of concurrent and asynchronoussystems. Discrete Event Dynamic Systems, 15(1):33–84, 2005.

[Cassandras and Lafortune, 1999] C. Cassandras and S. Lafortune.Introduction to Discrete Event Systems. Kluwer Academic Pub-lishers, 1999.

[Console et al., 2000] L. Console, C. Picardi, and M. Ribaudo. Di-agnosis and diagnosability using PEPA. In ECAI’2000, pages131–135, 2000.

[Cordier and Largouet, 2001] M.-O. Cordier and Ch. Largouet. Us-ing model-checking techniques for diagnosing discrete-event sys-tems. In Twelfth International Workshop on Principles of Diag-nosis (DX-01), pages 39–46, 2001.

[Cordier and Thiebaux, 1994] M.-O. Cordier and S. Thiebaux.Event-based diagnosis for evolutive systems. In DX’1994, pages64–69, 1994.

[Grastien et al., 2004] A. Grastien, M.-O. Cordier, andCh. Largouet. Extending decentralized discrete-event modellingto diagnose reconfigurable systems. In Fifteenth InternationalWorkshop on Principles of Diagnosis (DX-04), pages 75–80,2004.

[Grastien et al., 2005] A. Grastien, M.-O. Cordier, andCh. Largouet. Incremental diagnosis of discrete-event sys-tems. In Sixteenth International Workshop on Principles ofDiagnosis (DX-05), pages 119–124, 2005.

[Lamperti and Zanella, 2003] G. Lamperti and M. Zanella. Diag-nosis of Active Systems. Kluwer Academic Publishers, 2003.

[Lunze, 1999] J. Lunze. Discrete-event modeling and diagnosis ofquantized dynamical systems. In 10th International Workshop onPrinciples of Diagnosis (DX-99), pages 147–154, 1999.

[Pencole and Cordier, 2005] Y. Pencole and M.-O. Cordier. A for-mal framework for the decentralised diagnosis of large scale dis-crete event systems and its application to telecommunication net-works. Artificial Intelligence Journal, 164(1-2):121–170, 2005.

[Pencole et al., 2001a] Y. Pencole, M.-O. Cordier, and L. Roze. Adecentralized model-based diagnostic tool for complex systems.In The Thirteenth IEEE international conference on tools withartificial intelligence (ICTAI’01), pages 95–102, 2001.

[Pencole et al., 2001b] Y. Pencole, M.-O. Cordier, and L. Roze. In-cremental decentralized diagnosis approach for the supervisionof a telecommunication network. In Twelfth International Work-shop on Principles of Diagnosis (DX-01), pages 151–158, 2001.

[Roze and Cordier, 1998] L. Roze and M.-O. Cordier. Diagnosingdiscrete-event systems: an experiment in telecommunication net-works. In Fourth International Workshop on Discrete Event Sys-tems (WODES’98), 1998.

[Sampath et al., 1996] M. Sampath, R. Sengupta, S. Lafortune,K. Sinnamohideen, and D. Teneketzis. Failure diagnosis usingdiscrete-event models. Control Systems Technology, 4(2):105–124, 1996.


Multiple Fault Diagnosis in Complex Physical Systems

Matthew Daigle, Xenofon Koutsoukos, and Gautam Biswas

Institute for Software Integrated Systems (ISIS)

Department of Electrical Engineering and Computer Science

Vanderbilt University

Nashville, TN 37235

matthew.j.daigle,xenofon.koutsoukos,[email protected]

Abstract

Multiple fault diagnosis is a challenging problembecause the number of candidates grows exponen-tially in the number of faults. In addition, multiplefaults in dynamic systems may be hard to detect,because they can mask or compensate each other’seffects. The multiple fault problem is important,since the single fault assumption can lead to incor-rect or failed diagnoses when multiple faults occur.We present an approach to simultaneous and cas-caded multiple fault diagnosis in dynamical sys-tems. Our approach is based on the TRANSCEND

fault isolation scheme, where fault effects are rep-resented as qualitative fault signatures. A notionof multiple fault diagnosability is introduced withrespect to most likely minimal candidates. The on-line fault isolation algorithm explores the candidatespace in increasing candidate size to generate min-imal candidates. A mobile robot example demon-strates the approach.

1 Introduction

Fault detection and isolation (FDI) is a key component of anysafety-critical system. When faults and degradations occur,it is important to quickly identify the fault that occurred socorrective actions can be taken in a timely manner and catas-trophic situations can be avoided. In general, a number of dif-ferent failures can happen in complex systems, and the like-lihood of multiple faults occurring increases in harsh operat-ing environments. FDI schemes that do not take into accountmultiple faults run the risk of generating incorrect diagnosesor even failing to find a diagnosis after faults occur.

Our approach focuses on multiple fault diagnosis in com-plex physical systems. It is based on the TRANSCEND frame-work [Mosterman and Biswas, 1999; Manders et al., 2000],which employs a qualitative approach for analysis of faulttransient behavior. The diagnosis model is used to generatefault signatures, which represent magnitude and higher ordereffects of faults on the measurements.

Multiple fault diagnosis is a difficult problem in dynamicalsystems because interactions among fault effects can obscurethe fault signatures. In this paper, we provide a systematicscheme for generation of multiple fault signatures from the

single fault signatures. We analyze the multiple fault signa-tures to define the notion of n-diagnosability, which definesdiagnosability with respect to most likely minimal fault sets,where n is the maximum allowed fault multiplicity. We thenpresent an extension to the online fault isolation algorithm ofTRANSCEND such that it finds the most likely minimal faultset that is consistent with the observed measurement devi-ations. If a system is n-diagnosable for some n, the algo-rithm will isolate a unique multiple fault candidate, if n orless faults occur.

Previous work in multiple fault diagnosis has concentratedmostly on static systems. The approach in [de Kleer andWilliams, 1987] is based on conflict recognition and candi-date generation. The system, GDE, utilizes the notion of min-imal candidates, and chooses the next best measurements tomake based on a priori fault probabilities. In our approach,measurements must be selected at design time, and they areused to generate and refine fault hypotheses when deviationsfrom nominal behavior are observed. The GDE approachparallels the consistency-based diagnosis approach of [Re-iter, 1987], an extension of which is presented in [Ng, 1990]

to handle diagnosis of devices whose behavior changes overtime. The changes are modeled by a set of qualitative simula-tion states. A similar approach that handles behavioral modesis presented in [Subramanian and Mooney, 1996]. In contrast,our approach applies to continuous-time models and can han-dle both additive and multiplicative faults. A control theory-based approach based on residual structures is described in[Gertler, 1998]. A residual structure is derived to meet thedesired isolation properties. Our approach to multiple faultrepresentation is somewhat analogous, although our residualsmap to a richer feature set.

The paper is organized as follows. Section 2 describesthe TRANSCEND approach to qualitative fault isolation andpresents the example model. Section 3 formulates the rep-resentation of multiple faults and a notion of multiple faultdiagnosability based on the representation. Section 4 extendsthe fault isolation algorithm of TRANSCEND to account formultiple faults. Section 5 demonstrates our approach to mul-tiple fault diagnosis. Section 6 concludes the paper.

2 Background

TRANSCEND [Mosterman and Biswas, 1999] is a well-developed methodology for diagnosis of abrupt faults in com-


plex physical systems with continuous dynamics. It em-ploys a qualitative model-based approach for fault isolation.System models are constructed using bond graphs [Karnoppet al., 2000]. Faults are modeled as abrupt and persistentchanges in parameter values of components in the bond graphmodel of the system.

Fault isolation in TRANSCEND is based on a qualitativeanalysis of the transient dynamics caused by abrupt faults.Deviations in measurement values after a fault occurrenceconstitute a fault signature, where predicted deviations inmagnitude and higher order derivative values are mapped to+,0,- symbols, which correspond to a deviation abovenormal, no deviation, and a deviation below normal, respec-tively.

Fault isolation in TRANSCEND utilizes a Temporal CausalGraph (TCG) representation, which can be derived directlyfrom the bond graph model of the system. The TCG capturesthe causal and temporal relations between system variables. Itspecifies the signal flow graph of the system in a form whereedges are labeled with single component parameter values ordirect or inverse proportionality relations.

Fault signatures are generated using a forward-propagationalgorithm on the TCG to predict qualitative effects of faultson measurements. The qualitative effect of a fault, + or -, ispropagated to all measurement vertices in the TCG to deter-mine fault signatures for each measurement. We denote theset of all faults as F = f1, f2, . . . , fκ and the set of allmeasurements as M = m1,m2, . . . ,mλ. For f ∈ F andm ∈ M , σf,m is the fault signature for measurement m givenfault f has occurred. Two faults fi, fj ∈ F are distinguish-able using fault signatures if (∃m ∈ M) σfi,m 6= σfj ,m.

Relative measurement orderings [Daigle et al., 2005] arean extension to the original TRANSCEND algorithm. Theextended algorithm uses predicted temporal orders of mea-surement deviations to discriminate between faults. This isextended for multiple fault diagnosis. Like fault signatures,measurement orderings are derived systematically from theTCG. They are based on common subpaths in the model. Ameasurement ordering is denoted as m1 ≺f m2, meaningthat if fault f occurs, measurement m1 will deviate beforemeasurement m2. We denote the set of such orderings as Ωfi

for fault fi ∈ F . Two faults are distinguishable using order-ings if their ordering sets are in temporal conflict.

Definition 1 (Temporal Conflict). Ωfiis in temporal conflict

with Ωfjif (∃mi,mj ∈ M)mi ≺fi

mj ∧ mj ≺fjmi.

Fault isolation starts with a backward propagation of an ob-served symbolic deviation to identify initial fault candidates.Once candidate hypotheses are identified, a forward propa-gation algorithm generates the fault signatures and measure-ment orderings, i.e., the effects of each hypothesized faulton measurements. Then observed deviations are comparedto predictions using a progressive monitoring scheme to dis-criminate between the fault hypotheses.

Throughout the paper we focus on a mobile robot as an ex-ample system. Details of the system model and TCG for thissystem are described in [Daigle et al., 2006] and very brieflyhere. The bond graph is shown in Figure 1. The robot modelconsists of inertia, capacitor, and resistor elements modeling

Figure 1: Mobile robot bond graph

Figure 2: Mobile robot TCG


Fault vL vR θ Measurement Orderings

A−L 0- 0* 0+ vL ≺A

−L

vR, vL ≺A−L

θ

A−R 0* 0- 0- vR ≺A

−R

vL, vR ≺A−R

θ

E−L -+ 0* 0- vL ≺E

−L

vR, vL ≺E−L

θ

E−R 0* -+ 0+ vR ≺E

−R

vL, vR ≺E−R

θ

G+ 0+ 0- +- θ ≺G+ vL, θ ≺G+ vR

G− 0- 0+ -+ θ ≺G− vL, θ ≺G− vR

Table 1: Fault signatures for a robot system

masses and inertias, mechanical stiffness, and energy dissipa-tion in the system, respectively. The 1-junctions represent thecommon velocity points, and the 0-junctions common forcepoints. The TCG is given in Figure 2. State variables are cir-cled and measured variables boxed. Edges with a dt specifierimply an integration effect. All other edges are instantaneous.

Table 1 shows fault signatures for actuator (left: A−L , right:

A−R), encoder (left: E−

L , right: E−R ), and gyroscope (positive

bias: G+, negative bias: G−) faults in the mobile robot sys-tem. The measurements include velocity of the left wheel,vL, velocity of the right wheel, vR, and heading, θ. The firstsymbol indicates a predicted magnitude change (discontinu-ity) and the second symbol indicates the first nonzero slopesymbol in this measurement. A * indicates an indeterminateeffect. It is indistinguishable from a + or - because it couldmanifest as either effect. For example, from the TCG we can-not determine whether A−

L causes a 0+ or a 0- effect on vR.Relative measurement orderings are also listed in the table.

3 Multiple Fault Diagnosability

Single faults are isolated by comparing predicted to actualmeasurement deviations. The predictions depend on whichmeasurements are selected in the system, because differentmeasurements provide different discriminatory information.If the prediction models (fault signatures and measurementorderings) of two faults differ, we say that these two faultsare distinguishable.

Definition 2 (Single Fault Distinguishability). Two faultsfi, fj ∈ F are distinguishable if (∃m ∈ M) σfi,m 6= σfj ,m

or (∃mi,mj ∈ M)mi ≺fimj ∧ mj ≺fj

mi.

Definition 3 (Single Fault Diagnosability). A system is singlefault diagnosable if (∀fi, fj ∈ F ) fi and fj are distinguish-able.

For single faults, the isolation procedure compares the ob-served measurement deviations over time to those predictedby the fault signatures and measurement orderings. If the sys-tem is diagnosable, then there exists a unique fault which isconsistent with these deviations.

We expand our fault isolation procedure to deal with mul-tiple fault candidates.

Definition 4 (Candidate). A candidate is a set of faults c ⊆ Fthat is consistent with the observations. The set of all candi-dates is denoted as C = P(F ) and of all candidates of size≤ n as C(n).

Figure 3: Effect of fault occurrence times on symbol genera-tion of residual r(t)

Multiple fault diagnosis algorithms are more complex thansingle fault diagnosis algorithms for two reasons. First, theeffects of a fault could be masked or compensated by the ef-fects of another fault. For example, A−

L may occur, causingdeviations of 0- on vL, 0- on vR, and 0+ on θ. Clearly,these observations are consistent with only A−

L occurring.

However, if A−R also occurred, but with a smaller magnitude

so that the effects of A−L dominate, the fault sets A−

L and

A−L , A−

R cannot be distinguished. So, we seek to define di-agnosability with respect to most likely minimal candidates.

The second complication in multiple fault diagnosis is thatthe same multiple fault can manifest in different ways. Forexample, A−

L with E−L could either produce a 0- effect or a

-+ effect on vL, depending on which fault occurs first, andon the fault propagation delays in the system. If E−

L occursfirst, we will see -+ because discontinuities are observed atthe point of fault occurrence. However, if A−

L occurs first, we

may see either 0- or -+ depending on how soon E−L occurs

after A−L . Figure 3 illustrates this point. If E−

L occurs close

enough to A−L , the deviation caused by A−

L may not be de-tected. The symbol generation on the measurement residualcould compute either effect. The second change is also nothelpful because it could either be caused by a new fault or thedynamics of the original fault.

3.1 Representing Multiple Faults

Taking into account these issues, we represent the effectsof multiple faults on a single measurement as the union ofpredicted single fault effects. For example, the fault setA−

L , E−L could manifest either 0- or -+ on vL, 0- or 0+

on vR, and 0- or 0+ on θ.A multiple fault signature for a set of faults F ′ ⊆ F , de-

noted by σF ′,m, is an element of the set of possible fault sig-natures for the faults in F ′, i.e., ΣF ′,m = σf,m|f ∈ F ′.We define a complete fault signature as follows.

Definition 5 (Complete Fault Signature). A complete faultsignature for fault f ∈ F , denoted σf , is a tuple (σf,m1

,σf,m2

, . . ., σf,mλ) consisting of the signatures for f on

each measurement. A complete multiple fault signaturefor fault set F ′ ⊆ F is an element of the set of com-plete fault signatures ΣF ′ , where an element is denoted asσ′

F = (σF ′,m1, σF ′,m2

, . . . , σF ′,mλ), such that (∀σF ′ ∈

ΣF ′)(∀σF ′,mi∈ σF ′) σmi

∈ ΣF ′,mi.

Informally, a complete multiple fault signature for F ′ is acomplete signature which can be constructed by choosing and


vL vR θ Realizable?

1 0- 0- 0- no2 0- 0- 0+ yes3 0- 0+ 0- no4 0- 0+ 0+ yes5 -+ 0- 0- yes6 -+ 0- 0+ no7 -+ 0+ 0- yes8 -+ 0+ 0+ no

Table 2: The complete signatures of ΣA−L

,E−L and their

physical realizability

combining signatures for single measurements from faults inthe fault set F ′. As an example, Table 2 shows ΣA

−L

,E−L.

A complete multiple fault signature can be created bychoosing single signatures from 1 to |F ′| faults, where |F ′|is the size of the fault set F ′. As a result, a complete multi-ple fault signature set will consist of all those complete sig-natures of the individual faults it contains. Therefore, faulteffects due to fault masking and compensation are included.In general, for F ′′ ⊆ F ′, we have ΣF ′′ ⊆ ΣF ′ . This is evi-denced in Table 2, e.g., A−

L , E−L can produce (-+,0+,0-),

and according to Table 1, so can E−L by itself. The double

fault A−L , E−

L may occur, but the observed deviations may

be consistent with A−L or E−

L occurring by themselves.

3.2 Physically Realizable Fault Signatures

Not all signatures in ΣF ′ may physically manifest in the sys-tem behavior, determined by the fault propagation times in-herent in the system. The set ΣF ′ can be constrained by usingtemporal information in the system model. The resulting setis called the set of physically realizable fault signatures.

Definition 6 (Physical Realizability). A physically realizablecomplete fault signature for a fault set F ′, denoted ΣR

F ′ , is theset of multiple fault signatures for F ′ that is consistent withthe TCG model of system behavior.

Whether some σF ′ ∈ ΣF ′ belongs in ΣRF ′ can be deter-

mined using relative measurement orderings. Consider E−L

and G+. Both faults produce discontinuities (-+ or +-) onsome measurement. Because discontinuities manifest at thepoint of fault occurrence, it is not possible for both faults tooccur and not observe a discontinuity. We must either ob-serve -+ on vL, +- on θ, or both. Therefore, (0+,0-,0-), forexample, should not be in ΣR

E−L

,G+.

This notion can be formalized with relative measurementorderings. Essentially, single fault orderings should beobeyed with respect to single fault signatures. If some fault fi

produces a deviation on a measurement, mi, before anothermeasurement, mj , and another fault fj produces a deviationon mj before mi, then if both faults occur, we cannot observefi’s effect on mj together with fj’s effect on mi as the first

effects on mi and mj1. To see fi’s effect on mj , we would

1We are only interested in the first observed measurement devia-tion since that is what the symbol generator provides.

(a) Constraint 1 (b) Constraint 2

Figure 4: Realizability constraint representations

have had to observe its effect on mi first. Similarly, to seefj’s effect on mi, we would have had to observe its effect onmj first.

For simplicity, we express this constraint in terms of twofaults and two measurements. An automata representation isgiven as Figure 4(a). The top automaton represents the or-dering m1 ≺f1

m2 and the bottom m2 ≺f2m1. If f1 effects

m1 first (event σf1,m1) and f2 effects m2 first (event σf2,m2),then we cannot observe both f1’s effect on m2 and f2’s effecton m1 as the first deviations on m1 and m2. If these are theonly two measurements, then if f1 and f2 occur together, wemust observe f1’s effect on m1 or f2’s effect on m2 as thefirst deviation on the respective measurements. This propertyis expressed by the synchronous composition of the two au-tomata, and stated formally as the following lemma.

Lemma 1 (Realizability Constraint 1). For two faults fi, fj ∈F and two measurements mi,mj ∈ M , if mi ≺fi

mj

and mj ≺fjmi, then (∀σfi,fj ∈ Σfi,fj), σfi,fj /∈

ΣRfi,fj

if σfi,fj,mi= σfj ,mi

6= σfi,miand σfi,fj,mj

=

σfi,mj6= σfj ,mj

.

A related constraint evolves from this information. Con-sider again the fault set A−

L , E−L . Orderings predict that

both faults manifest in vL first. Therefore, if vL deviates as0-, then A−

L will propagate to the rest of the measurements

before E−L does, so we will not see any effects inconsistent

with A−L , e.g., we will not see 0- on θ. This is because E−

L

cannot propagate from vL to θ any faster than A−L can.

The physical reasoning behind this constraint is that theordering mi ≺fi

mj implies that the fastest way to reachmj is through mi given fi has occurred. So if some otherfault reaches mi first, it will traverse this same path to mj ,and cause mj to deviate from its effect propagating on thispath (or from some faster path fj to mj). Therefore when fi

finally reaches mi, it cannot propagate to mj any faster thanfj had, so we cannot observe its effect on mj .

For simplicity, we express this constraint also in terms oftwo faults and two measurements. An automata representa-tion is given as Figure 4(b). The top automaton represents theordering m1 ≺f1

m2 and the bottom represents the constraintthat we will only observe the effect on a measurement fromone fault. If f2 effects m1 first, then we cannot observe f1’seffect on m2. This property is expressed by the synchronouscomposition of the two automata, and stated formally as thefollowing lemma.


Lemma 2 (Realizability Constraint 2). For two faults fi, fj ∈F and two measurements mi,mj ∈ M , if mi ≺fi

mj , then

(∀σfi,fj ∈ Σfi,fj), σfi,fj /∈ ΣRfi,fj

if σfi,fj,mi=

σfj ,mi6= σfi,mi

and σfi,fj,mj= σfi,mj

6= σfj ,mj.

Table 2 lists the set of physically realizable signaturesbased on these constraints for A−

L , E−L . Signatures 1, 3,

6, and 8 are not realizable due to the second constraint.An additional constraint that we impose is to only allow

certain combinations of faults, as this will also limit the num-ber of complete multiple fault signatures. It does not makesense to allow fault sets consisting of multiple changes of thesame parameter because we assume fault effects are persis-tent. Therefore, examples such as G+, G− are not validcandidates.

We also employ practical knowledge about systems to limitthe size of allowable fault candidate sets. The assumptionis that candidates with a large number of faults are highlyunlikely, therefore, we assume that the maximum candidatesize is ≤ n. The set of all fault signatures for fault sets of size≤ n is denoted as Σ(n) = σF ′ ∈ ΣF ′ |F ′ ⊆ F, |F ′| ≤ n.The set of all physically realizable fault signatures for faultsets of size ≤ n is denoted as ΣR(n) = σF ′ ∈ ΣR

F ′ |F ′ ⊆F, |F ′| ≤ n.

The realizability constraints can be extended to multiplefaults and measurements. A general way to describe the con-straints is by using the automata representation. For a givenfault set, we can describe its possible set of event trajecto-ries (and thus physically realizable fault signatures) by takingthe synchronous product of all the single fault orderings andthe two-state automata that represent a measurement beingeffected by only one fault. To compute ΣR(n) from this, weneed only restrict the trajectories to those including eventsfrom at most n faults.

We can also define the measurement orderings that can becreated by multiple faults as ΩFi,Fj = ΩFi

∩ ΩFj, for

Fi, Fj ⊆ F . That is, only shared measurement orderings willbe consistent with both faults occurring in any order. This canbe seen in the automata representation of the orderings.

3.3 n-diagnosability

Based on the set of physically realizable multiple fault signa-tures and relative measurement orderings for multiple faults,we can define the notion of distinguishability between candi-dates for multiple faults.

Definition 7 (Multiple Fault Distinguishability). Two faultsets Fi and Fj are distinguishable if ΣR

Fi∩ ΣR

Fj= ∅ or ΩFi

is in temporal conflict with ΩFj.

Informally, two fault sets are distinguishable if it is not pos-sible for them to manifest in the system measurements in thesame way. We do not, however, define multiple fault diag-nosability using this definition. We described previously how,due to fault masking and compensation, a fault set and a su-perset may manifest in the same way. If so, then for F ′ ⊆ F ′′,ΣR

F ′ ⊆ ΣRF ′′ , and ΩF ′ ⊆ ΩF ′′ . We, therefore, consider diag-

nosability only with respect to minimal candidates.

Definition 8 (Minimal Candidate). A candidate c is minimalif there does not exist a candidate c′ such that c′ ⊂ c.

In addition to using minimal candidates, we also considerthe likelihood of fault occurrence. The assumption is thatall faults are equally likely, so candidates of smaller size aremore likely than those of larger size. Therefore, the ultimategoal of the fault isolation procedure is in isolating the mini-mal candidate of smallest size. In general, f1, f2 and f3may both be minimal candidates, because one is not a subsetof the other. We consider f3 to be the simpler explanationbecause it is of smaller size. Therefore, the fault isolation pro-cedure does not have to consider less likely candidates whenmore likely candidates exist.

The main reason for operating with most likely candidatesis that fault masking and compensation may prevent us fromisolating the true set of faults that has occurred. We do notwish to classify a system as undiagnosable because we can-not distinguish between a candidate a superset. Like otherwork, we assume the principle of parsimony [Reiter, 1987]

and consider a diagnosis as the simplest explanation given theobserved measurement deviations. The assumption is furthersupported, in general, by the fact that the probability of fail-ure occurrence decreases significantly as fault size increases.A diagnosis only represents a best effort result. A diagnosisof f1, f2, for example, means that at least f1 and f2 musthave occurred, but does not mean that some other fault f3 hasnot also happened, rather, it only implies that f3 could nothave occurred by itself.

Definition 9 (Fault Isolation Procedure). Given a candidatesize limit n > 0 and the set of measurement orderings,the fault isolation procedure is a function I : ΣR(n) →P(C(n)).

Fault isolation operates in a progressive fashion as newmeasurements deviate. Because only physically realizablefault signatures for candidates of size ≤ n are given as input,this function will always return a nonempty set of candidates.Multiple fault diagnosability is defined in terms of the faultisolation procedure and the given candidate size limit.

Definition 10 (n-diagnosability). Given a candidate sizelimit n, a system is n-diagnosable if after all measurementshave deviated, (∀σF ′ ∈ ΣR(n)) |I(σF ′)| = 1.

Informally, a system is n-diagnosable if given any physi-cally realizable multiple fault signature for candidates of size≤ n, a single minimal candidate of smallest size ≤ n is iso-lated. We next describe our fault isolation procedure basedon this notion of multiple fault diagnosability.

4 Diagnosing Multiple Faults

We follow the conflict-based approach of [de Kleer andWilliams, 1987], where a conflict is defined as a set of as-sumptions which cannot all be true, and thus support a symp-tom (e.g., a1 ∧ a2 ∧ a3). In TRANSCEND, the TCG is usedto create a direct mapping from faults to symptoms, i.e., faultsignatures and measurement orderings. Instead of using con-flicts, we refer to a hypothesis set, which represents all possi-ble faults which can explain a particular symptom.

Definition 11 (Hypothesis Set). A hypothesis set is a set offaults, at least one of which must have occurred given a par-ticular set of measurement deviations that have occurred.


A hypothesis set is equivalent to a conflict, in that it repre-sents a set of negated assumptions (an assumption being that acertain parameter is not faulty), at least one of which must betrue (e.g., a conflict a1 ∧ a2 ∧ a3 ≡ a1∨a2∨a3 ≡ f1∨f2∨f3,a hypothesis set).

Hypothesis sets can be generated directly from the faultsignature matrix and measurement orderings. Given a mea-surement deviation, we construct the hypothesis set to be theset of faults consistent with the deviation. For example, givena 0- for vL and using only fault signatures produces the hy-pothesis set A−

L , A−R, E−

R , G−. Any of these faults occur-ring, or combinations of them, support the symptom.

Candidate generation proceeds similar to [de Kleer andWilliams, 1987]. As new measurements deviate, new hy-pothesis sets are generated. These hypothesis sets restrict thepossible candidate space and result in a new set of minimalcandidates. Given a new hypothesis set, new candidates areformed by adding a single fault from the new hypothesis set.Since a hypothesis set is a set of faults consistent with an ob-servation, these new candidates will also be consistent withthe new observation as well as all old observations coveredby the base candidate.

Because n-diagnosability only requires isolating a uniquecandidate of the smallest size, we introduce a candidate sizelimit into our procedure. As long as we have a candidate atour current size level, we do not explore candidates of largersize. Further, we only perform this analysis if we eliminateall candidates at the current level.

To illustrate the general approach, consider the fault setA−

L , A−R, E−

L , E−R. The candidate space, which can be rep-

resented as a lattice of C, is shown in Figure 5. The candi-date size limit is given as n = 2, and the starting size level isn = 1. Given the first measurement deviation -+ for vL gen-erates the hypothesis set E−

L , because only that fault canproduce that deviation on vL given vR and θ have not yet de-viated. We now know that this fault must have occurred. Ata later time point, we are given the deviation 0- for vR. Thisgenerates the hypothesis set A−

R, E−L , because only these

faults can cause vR to deviate that way given θ has not yetdeviated. A−

L is not included in this hypothesis set becauseit did not cause vL to deviate, so we can’t see its effect onvR (this relates to the second realizability constraint). At thispoint, we still have a candidate of size 1, so we do not yet con-sider any of size 2. If we were to consider the complete faultset, then a deviation of +- for θ would rule out the possibilitythat E−

L by itself occurred, and we now consider candidatesof size 2. If the system is 2-diagnosable, a unique candidateof size 2 will be identified.

The pseudocode for the online diagnosis algorithm isshown as Algorithm 1. It works as follows. As new mea-surements deviate, hypothesis sets are formed and the candi-date set refined by eliminating inconsistent candidates. Thisfollows the TRANSCEND approach. Eliminated candidatesare saved for later analysis. If a single unique candidate isfound during this procedure, the candidate is returned as themost likely minimal candidate, barring any future measure-ment deviations.

When faults at the candidate size level l are all eliminated,the discarded minimal candidates are used to produce new

Figure 5: Candidate lattice for fault set A−L , A−

R, E−L , E−

R

Algorithm 1 Fault Isolation

Input: maximum candidate size nVariables: current candidates list, hypothesis sets list,eliminated candidates listWhen a new measurement deviates:Form the conflict and record itEliminate inconsistent candidatesif no candidates are left then

Expand eliminated candidates to the next sizeend ifif one candidate is left then

Return the candidateend if

minimal candidates of size l + 1 using the hypothesis setsgathered. This procedure is given as Function 2. For eacheliminated candidate, new candidates of size l +1 are formedusing the hypothesis set which caused it to be eliminated.Since the hypothesis set caused the elimination, the hypoth-esis set and the eliminated candidate have no common fault,so a candidate of size l cannot be constructed. Since newcandidates are formed by adding exactly one fault from thehypothesis set, only candidates of size l + 1 are formed.

Each new candidate formed is then checked for consistencywith hypothesis sets that were recorded after its base candi-date was eliminated. If the new candidate is consistent withall of these, it is added to the current candidate list. If not,it is added to the eliminated candidates list, because applyinga new hypothesis set would form a candidate of size l + 2,which we are not considering at that time. If no new can-didates are found then the level is increased and the processrepeated. If the size limit is reached, then an unmodeled faultor a fault combination of size > n has occurred.

Theorem 1. Algorithm 1 will return a unique most likely min-imal candidate if the system is n-diagnosable and a fault com-bination of size l ≤ n occurs.

Proof. The algorithm never eliminates consistent candidates.The algorithm also only considers larger candidates when nosmaller candidate can explain the observations. Therefore,the algorithm will find the smallest set of candidates at anylevel. If the system is n-diagnosable, then a unique candidatewill exist of size ≤ n. If so, at the lowest possible level the


Function 2 Expand Candidates

Input: maximum candidate size nif candidate size limit is exceeded then

Return failureend iffor all eliminated candidates of the previous size do

Construct new candidates using the conflict that causedits elimination

end forEliminate candidates inconsistent with the recorded con-flictsif no candidates are left then

Expand eliminated candidates to the next sizeelse

Return candidatesend if

algorithm will find a unique candidate.

If n is fixed, the computational complexity of the algorithmis polynomial in the number of single faults, because O(|F |n)multiple faults are considered. If n is left unspecified, we arelimited to a fault multiplicity of |F |. In this case the algorithmis exponential in the number of single faults.

In the single fault algorithm, as soon as a single fault is iso-lated, it is declared as the true fault, and future measurementsdeviating can be ignored. In the case of multiple faults, a sin-gle isolated fault does not necessarily indicate the true fault. Itonly indicates the current simplest diagnosis, given the devi-ations observed thus far. So, future measurement deviationsmay result in a better understanding of what faults actuallyoccurred in the system. If there is a unique candidate at anypoint, the algorithm will return it. Because more measure-ment deviations can only expand this candidate, the currentunique candidate is partially correct. Future deviations mayor may not provide a more exact diagnosis.

5 Mobile Robot Example

In this section, we go through a detailed example executionof Algorithm 1. First, however, we must analyze the diagnos-ability of the system to ensure we will get unique results. Welet n = 2 for our analysis.

Table 3 lists some of the physically realizable fault signa-tures for the robot system. There are several points to makehere. First, the signature (0+,0-,0+) is absent. This is be-cause it violates the realizability constraints. There are sev-eral double faults which contain this signature in their signa-ture set. However, this signature is not physically realizablefor any of them. Take for example, A−

L , A−R. Only A−

R can

produce 0+ on vL. Because A−L causes vL to deviate first,

this means that A−R will affect θ first, however only A−

L canproduce 0+ on θ. Thus, this signature violates the secondrealizability constraint for this double fault.

We also see from Table 3 that the system is not 2-diagnosable. If θ deviates first, observing either (0-,0-,+-)or (0-,0-,-+) cannot be explained by a single fault, but twodouble faults are consistent with each. For example, con-sider observing (0-,0-,+-) with θ deviating first. If then

ΣR(2) Smallest minimal candidates

(0-,0-,0-) A−R (vR first) or

A−L , A−

R (vL first)

(0-,0-,0+) A−L (vL first) or

A−L , A−

R (vR first)

(0-,0-,+-) A−L , G+, A−

R, G+ (θ first) or

A−L , G+ (vL first) or

A−R, G+ (vR first)

(0-,0-,-+) A−L , G−, A−

R, G− (θ first) or

A−L , G− (vL first) or

A−R, G− (vR first)

(0+,0-,0-) A−R (vR first)

(0+,0-,+-) G+ (θ first) or

A−R, G+ (vR first)

......

(-+,-+,0-) E−L , E−

R (vL or vR first)

(-+,-+,0+) E−L , E−

R (vL or vR first)

Table 3: 2-Diagnosability analysis for the mobile robot

both wheels start slowing down, this cannot be explained byG+ by itself. However, given that both velocities are belownominal, we cannot determine which actuator fault caused it,because only θ allows us to discriminate between them in thiscase. Orderings do not help either, because even if we see vL

or vR deviate next, we do not know if that deviation was dueto G+ propagating or an actuator fault appearing. Althoughwe cannot distinguish which actuator fault occurred with G+,we still know that G+ must have occurred, and that some ac-tuator fault has also occurred. This can sometimes be helpful.

We now consider a double fault which is distinguishable,and demonstrate the execution of the algorithm. Table 4 il-lustrates the approach for E−

L , G+ occurring. First, vL de-viates with a -+. Only an encoder fault of the left wheelcan produce such a deviation on vL given that no other mea-surements have deviated, thus the hypothesis set is E−

L which becomes our first candidate. Next, vR deviates witha 0-. Given that θ has not yet deviated, the hypothesis setbecomes A−

R, E−L . G+ is not included in this hypothesis

set because we would have seen θ deviate if it had occurred(constraint 1), and neither is A−

L , because to observe its ef-fect on vR would mean we would have seen its effect on vL

(constraint 2). Since E−L is consistent with this hypothe-

sis set, it remains a candidate. Next, θ deviates with a +-.The hypothesis set is G+ since only G+ can cause θ to

deviate in that way. Since E−L is not consistent with this

hypothesis set, it is eliminated. We now have to expand oureliminated candidates to explain the observations. Since thehypothesis set G+ eliminated E−

L , we form the new can-

didate E−L , G+. Since all measurements have deviated, we

can be sure that this is our smallest minimal candidate. SinceE−

L , G+ is distinguishable from all other double faults, thealgorithm gives a unique result.

We next consider a case where, although the signature isrealizable for a single fault, can only be explained by a dou-ble fault. The signature (0-,0-,0-) is realizable for A−

R,


Observation Hypothesis set Candidates Eliminated

1. vL -+ E−

L E−

L ∅

2. vR 0- A−

R, E−

L E−

L ∅

3. θ +- G+ ∅ E−

L

Apply (3) E−

L, G+ ∅

Table 4: Algorithm execution example 1

Observation Hypothesis set Candidates Eliminated

1. vL 0- A−

L A−

L ∅

2. vR 0- A−

L, A−

R A−

L ∅

3. θ 0- A−

R ∅ A−

L

Apply (3) A−

L, A−

R ∅

Table 5: Algorithm execution example 2

however if vR does not deviate first it cannot be only A−R

which has occurred. However, this signature is realizable forA−

L , A−R, and we show how the algorithm derives this re-

sult.Table 5 summarizes the algorithm execution for this case.

First, we see vL deviate with 0-. Only A−L is consistent with

vL deviating first with this effect, thus the hypothesis set isA−

L. Next, we observe vR deviate with 0-. Given θ has

not yet deviated, A−L , A−

R is the hypothesis set for the new

observation. E−L is not included because to observe its effect

on vR would mean we would have seen its effect on vL (con-straint 2). Next, we see θ deviate with 0-. Only A−

R can cause

this (and not E−L for the previous reason). Therefore A−

Lis eliminated, and we expand the candidate into A−

L , A−R.

Again, we have a unique result.

6 Conclusions

Multiple fault diagnosis in dynamical systems is complex dueto fault masking, compensation, and the many ways multi-ple faults can manifest. We have presented here an approachto qualitative isolation of multiple faults as an extension ofthe TRANSCEND approach. We described a notion of mul-tiple fault diagnosability defined over smallest minimal can-didates, and presented an algorithm to isolate multiple faultsbased on this notion. We then discussed the 2-diagnosabilityanalysis of a mobile robot system, and illustrated the algo-rithm on distinguishable double faults.

Future work will address the scalability of the approachto larger systems and exploring conditions which satisfy n-diagnosability for a specific n. The notion of dealing withonly the smallest l value and moving to the next l value mayalso be relaxed by taking into account a priori fault prob-abilities for the different component parameters, for whichmore efficient candidate generation strategies will be ex-plored, such as conflict-directed A* [Williams and Ragno, toappear]. Exploring fault identification and fault-adaptive con-trol in the presence of multiple faults is also an open area ofresearch.

Acknowledgment

This work was supported in part by NSF CNS-0452067 andNSF CNS-0347440.

References

[Daigle et al., 2005] M. Daigle, X. Koutsoukos, andG. Biswas. Relative measurement orderings in diagnosisof distributed physical systems. In 43rd Annual AllertonConference on Communication, Control, and Computing,pages 1707–1716, September 2005.

[Daigle et al., 2006] M. Daigle, X. Koutsoukos, andG. Biswas. Distributed diagnosis of coupled mobilerobots. In Proceedings 2006 IEEE International Confer-ence on Robotics and Automation, pages 3787–3794, May2006.

[de Kleer and Williams, 1987] J. de Kleer and B. C.Williams. Diagnosing multiple faults. Artificial Intelli-gence, 32:97–130, 1987.

[Gertler, 1998] J. Gertler. Fault Detection and Diagnosis inEngineering Systems. Marcel Dekker, New York, 1998.

[Karnopp et al., 2000] D. C. Karnopp, D. L. Margolis, andR. C. Rosenberg. Systems Dynamics: Modeling and Sim-ulation of Mechatronic Systems. John Wiley & Sons, Inc.,New York, 3rd edition, 2000.

[Manders et al., 2000] E.-J. Manders, S. Narasimhan,G. Biswas, and P.J. Mosterman. A combined qualita-tive/quantitative approach for fault isolation in continuousdynamic systems. In SafeProcess 2000, volume 1, pages1074–1079, Budapest, Hungary, June 2000.

[Mosterman and Biswas, 1999] P.J. Mosterman andG. Biswas. Diagnosis of continuous valued systemsin transient operating regions. IEEE Transactions onSystems, Man and Cybernetics, Part A, 29(6):554–565,1999.

[Ng, 1990] H. T. Ng. Model-based, multiple fault diagno-sis of time-varying, continuous physical devices. In SixthConference on Artificial Intelligence Applications, vol-ume 1, pages 9–15, May 1990.

[Reiter, 1987] R. Reiter. A theory of diagnosis from firstprinciples. In Matthew L. Ginsberg, editor, Readings inNonmonotonic Reasoning, pages 352–371. Morgan Kauf-mann, Los Altos, California, 1987.

[Subramanian and Mooney, 1996] S. Subramanian and R. J.Mooney. Qualitative multiple-fault diagnosis of continu-ous dynamic systems using behavioral modes. In The 199613th National Conference on Artificial Intelligence, pages965–970, August 1996.

[Williams and Ragno, to appear] B. C. Williams andR. Ragno. Conflict-directed A* and its role in model-based embedded systems. Special Issue on Theory andApplications of Satisfiability Testing, Journal of DiscreteApplied Math, to appear.


Improvement of Chronicle-based Monitoringusing Temporal Focalization and Hierarchization.

Christophe Dousson and Pierre Le MaigatFrance Telecom R&D,2 avenue Pierre Marzin,

22307 Lannion cedex, France.christophe.dousson,[email protected]

Abstract

This article falls under the problem of the symbolicmonitoring of real-time complex systems such astelecommunications networks or of the video inter-pretation systems. Among the various techniquesused for the on-line monitoring, we are interestedhere in the temporal scenario recognition. In or-der to reduce the complexity of the recognition and,consequently, to improve its performance, we ex-plore two methods: the first one is the focalizationon particular events (in practice, uncommon ones)and the second one is the factorization of commontemporal scenarios in order to do a hierarchicalrecognition. In this article, we present both con-cepts and merge them to do a focalized hierarchicalrecognition. This approach merges and generalizesthe two main approaches in symbolic recognition oftemporal scenarios: the Store Totally RecognizedScenarios (STRS) approach and the Store PartiallyRecognized Scenarios (SPRS) approach.

1 IntroductionSymbolic scenario recognition arises in monitoring of dy-namic systems in many areas such as telecommunicationsnetworks supervision, gas turbine control, healthcare moni-toring or automatic video interpretation (for an overview, re-fer to [Cordier and Dousson, 2000]).

Such scenarios could be obtained among other things byexperts, by automatic learning [Fessant et al., 2004; Vautieret al., 2005] or by deriving a behavioral model of the sys-tem [Guerraz and Dousson, 2004]. Due to the symbolic na-ture of those scenarios, the engine performing the recogni-tion is rarely directly connected to sensors. Often there is (atleast) one dedicated calculus module which transforms the“raw” data sent by the system into symbolic events. Typi-cally this module can compute a numerical quantity and sendssymbolic events when the computed value reaches a giventhreshold. In cognitive vision, this module is usually a video-processing which transforms images into symbolic data.

Often those scenarios are a combination of logical and tem-poral constraints. So in those cases, symbolic scenario recog-nition engine can process the scenarios uniformly as a set ofconstraints (like the Event Manager of ILOG/JRules based on

a modified RETE algorithm for processing time constraints[Berstel, 2006]) or separate the processing of temporal datafrom the others like in [Dousson, 2002]. This article mainlydeals with this second approach where temporal constraintsare managed by a constraint graph between relevant timepoints of the scenarios. By the way, there are two approachesfor dealing with temporal constraints: STRS recognizes sce-narios by an analysis of the past [Rota and Thonnat, 2000]and SPRS which performs an analysis of scenarios that canbe recognized in the future [Ghallab, 1996]. Two main prob-lems in the SPRS approach are the fact that scenarios have tobe bounded in time in order to avoid never ending expectedscenario (in practice, when working on real-time systems, itis difficult to exhibit scenario which cannot be bounded intime); and that SPRS engine has to maintain all partially sce-narios which possibly leads to use a large amount of mem-ory space; in particular the combinatorial explosion in thecase of “multi-actors”, i.e., scenarios with events involvingvariables. To partially avoid those drawbacks, the implemen-tation of SPRS algorithms in [Dousson, 2002] introduces aclock and deadlines which are used to garbage collect thepending scenarios and introduces also variable instantiation(or propagation) mechanisms. On the other hand, the mainproblem with STRS algorithms is to maintain all previouslyrecognized scenarios. To our knowledge, no work has beenpublished on how long such scenarios should be maintained.

A first attempt to take the benefits of both approaches wasmade in [Vu et al., 2003]. It consists of a hierarchization ofthe constraint graph of the scenario. It deals only with graphswhere all information about time constraints can be retrievedfrom a path where temporal instants can be totally ordered.The hierarchy constructs an imbricated sequence of scenar-ios containing only two events at a time. The principle of therecognition is, at any instant, to instantiate elementary scenar-ios and when an event is integrated in a high-level scenario,looking for previously recognized elementary scenarios. Thepurpose of this article is to generalize this method; the start-ing point will be an SPRS approach and the generalizationmixes reasoning on past and future. As a byproduct, STRSand SPRS methods appear as two extreme kinds of focalizedhierarchical recognition.

The next section presents the used SPRS approach and de-tails some aspects which are relevant to this paper. The sec-tion 3 is dedicated to the temporal focalization used to fo-


cus the system on uncommon events. As already said, eventscould be not only basic events coming directly from the su-pervised system but aggregated indicators. So the focaliza-tion could be used in order to control the computation of suchindicators on particular temporal windows and to avoid use-less computation. As such indicators could be themselveschronicles. Section 4 presents how the hierarchical recogni-tion deals with common subpatterns of chronicles. Finally, assaid, we show that both concepts could be merged and exper-imentally lead to good improvement of performances. Thiswill be the object of the section 5. We conclude in section 6by experimentations on detecting naive server in a ReflectedDistributed Denied of Service (RDDoS) Attack and on detec-tion of abnormal patterns in cardiac behavior [Carrault et al.,1999].

2 Chronicle Recognition SystemOur approach is based on the chronicle recognition as pro-posed in [Dousson, 2002] which falls in the field SPRS meth-ods. A chronicle is given by a time constraint graph labeledby predicates. This formalism is based on a reified logic (achronicle is a conjunctive formula). In this article we chooseto present in details the focalization from the time constraintgraph point of view. Variables and predicates other thanevents (persistency, noevent, ...) are not discussed in this pa-per but illustration is provided by experiments in section 6.

An instance of event is a pair (e, t) where t is the date of theevent and e is its type. When no ambiguity results, we some-times do not distinguish between an event and its type. Figure1 shows a chronicle (more precisely a chronicle model) whichcontains four events: the event e (if instantiated) must occurbetween 2 and 3 units of time after an instantiation of f , theevent g must occur between 0 and 3 units of time after e andbetween 1 and 4 units of time after e′.

[0,3]e

e'[1,4]

g

f [2,3]

Figure 1: The chronicle model C

2.1 Recognition AlgorithmsLet CRS (Chronicle Recognition System) denote the algo-rithm of recognition. Basically the mechanism of CRS is, ateach incoming event, to try to integrate it in all the pending(and partial) instances of the chronicle model and/or createda new instance and calculating new forthcoming windows forall the forthcoming events of each instance. An instance ofa chronicle model is then a partial instantiation of this modeland forthcoming windows fc(e) of a non-instantiated event eis the (extended1) interval where the occurrence of an eventcould lead to a recognition2.

1An extended interval is defined as a union of intervals.2This does not imply that, for all non instantiated events e, if

(e, t) occurs with t ∈ fc(e) then the instance is recognized. This a

The forthcoming windows are calculated using constraintspropagation. For chronicle C, if the monitored system emits(f, 1)(e, 3)(e′, 5), then, when CRS receives f , it creates thenew instance I1 (see figure 2), updates the forthcoming win-dows (showed in nodes). When e occurs, instance I2 is cre-ated (instance I1 is not destroyed, waiting for a potential e at4). Now, when e′ occurs, I3 is created and I1 is destroyed asno more event e could from now be integrated into. Finally ifg arrive at 6, all events of the instance I3 are instantiated andthen chronicle C is recognized thanks to I3.

I3

e,3

e',5

g,[6,6]

f,1

I1

e,[3,4]

e',[-1,6]

g,[3,7]

f,1

I2

e,3

e',[-1,5]

g,[3,6]

f,1

Figure 2: Created instances of the chronicle of figure 1 by theincoming stream (f, 1)(e, 3)(e′, 5).

2.2 From Clock to “assert no more” Event.In first implementations of chronicle recognition, a clock wasintroduced in order to discard “impossible” instances whenthe clock is out of one of the forthcoming window of an in-stance (in other words, when one missing event could neverbe received from the system in the forthcoming window).

In order to take into account some jitter in data transmis-sion a possible delay δ can be taken into account. This delayis equal to the maximum difference observed at reception be-tween two events sending at the same time by the (possiblydistributed) system3. Basically, the event integration algo-rithm could be written as following:

integrate((e, t));setGarbageClock(t− δ)

The main drawback is that it implies that the events arriveroughly in a FIFO manner (the allowed jitter is bounded byδ): so, when the FIFO hypothesis should be relaxed (and itis often the case when monitored systems are distributed), δshould be increased and the garbage efficiency decreases.

In order to avoid this, instead of a clock, we define the as-sertion: assertNoMoreEvent(e, I), where e is an event typeand I an extended interval. It specifies to CRS that, from now,it will not receive events of type e with an occurence datein I . We do not describe here the algorithm since it is verysimilar to the CRS one but, intuitively, all the forthcomingwindows of e are reduced (fc(e) \ I) and propagated accord-ing to the constraint graph of the chronicle. As in previousCRS, if a forthcoming window becomes empty, the instance

difference between SPRS mechanism where integration of events isincremental and some STRS mechanism where integration is madeby block in a backward manner.

3In case of focalization, an other side effect of using clock is thecreation of “false” instances. This will be explained in section 6.


is discarded. So, the previous behavior of CRS is given bythe following:

integrate((e, t));∀e, assertNoMoreEvent(e, ]−∞, t− δ])

As we allow I to be an extended interval, more com-plex garbage management could be easily implemented: itcould be different from ] − ∞, clock] and we could pro-cess assertNoMoreEvent(e, [10, 20] ∪ [30, 40]) and, then,assertNoMoreEvent(e, [0, 10]). This mechanism is imple-mented in CRS by Streams (one for each event type) whichmanage the garbage window (gw) when receiving an Assert-NoMoreEvent message.

Integrate event (e, t):- for stream e do:

- integrate (e, t) in instances involving e.- update gw(e)- update fc(e) and propagate.

- ∀ stream e′ = e do:- update gw(e′)- update fc(e′) and propagate.

This point of view changes slightly the manner the engineworks: the progress of time is no more driven by the super-vised system but by an “internal” control, this control is givenby the knowledge of the temporal behavior of the system. Wewill see in the next section how this new feature in used bythe temporal focalization.

3 The Temporal Focalization.3.1 General DescriptionIn some cases, not all events have the same status: event fcould be very frequent and event e extremely uncommon.Due to this difference of frequency, the recognition of thechronicle could be impossible in practice, indeed each eventf potentially creates a new instance of the chronicle wait-ing for other events: if a thousand f arrive between 0 and1, a thousand instances will be created. As event e is ex-tremely uncommon most of those instances would be finallydestroyed. In CRS, the number of creation of instances has agreat impact on performance of the recognition. In order toreduce this number, we focalize on event e in the followingmanner: when events f arrive, we store them in a collectorand we created new instances only when events e occur, thenwe will search in the collector which f could be integratedin the instances. Potentially the number of created instanceswill be the number of e and not the number of f . In order tobe not limited to uncommon events, we introduce a level foreach event type. The principle of the temporal focalization isthen the following:

Begin the integration of events of level n + 1 only when thereexists an instance such that all events of level between 1 and

n had been integrated.

For the example of figure 2, if e and e′ are uncommon, fis frequent and g is very frequent, we define level(e) =level(e′) = 1, level(f) = 2 and level(g) = 3. The engine

will integrate e at t = 3 and e′ at t = 5 then it will search incollector the events f between t = 0 and t = 1. The collectorfinds (f, 1) and sends it to the engine, this leads to the cre-ation of the instance I3. Technically we made the choice ofsending f not only to instances waiting for events of level 2but to all the pending instances. As the number of f sendingfrom the collector to the engine is small, only a small numberinstances are created. In our example, under the FIFO hy-pothesis, (f, 1) can not instanciate a new instance as the lastreceived event is e′ at t = 5.

In addition to the recognition engine, we develop somemodules (see figure 3): a module named Event Router whichroutes the events coming from the supervised system to eitherthe engine or the collector; a module named Finished LevelDetector which detects when a particular instance of chron-icle has finished to integrate all events of level lesser thann, then looks at the forthcoming windows of events of leveln + 1 and sends them to the collector (label ?f ∈ [a, b]); and,finally, a module called No Need Window Calculator, whichcomputes for all current instances the windows when they donot need events of level greater than 2 and sends this informa-tion to the collector (in the figure, the engine does not need fon ] −∞, x] ∪ [y, z]); this window is the intersection of thecomplementary of the forthcoming windows of all instancesof all chronicle models involving f .

The collector itself is split into Collecting Streams: one foreach event type of level ≥ 2 (high-level events); in the pre-vious example, there will be one collecting stream for f andone for g. The collecting stream manages 3 particular tempo-ral windows: i) the assert no more window which contains allthe occurrence dates for that no more event will be receivedfrom now (this window is used to discard pending instances),ii) the exclusion window which contains all the occurrencedates for which the pending chronicles don’t care (this win-

Collector

store (f,t)

(e,d)(f,t)...

IncomingEvents

?f [a,b]

(f,t) or ' '

Recognition Engine (CRS)

EventRouter

Finished LevelDetector

No Need Window Calculator

¬f [-,x] [y,z]

Collecting Stream f:(f,t),(f,t')(f,t'')...

Collecting Stream g:(g,d),(g,d')...

¬f [t,b]

ew(f)anmw(f)nw(f)

nw(g)anmw(g)ew(g)

Stream e

Stream f

gw(e)

gw(f) instances

Figure 3: Architecture of the focalized recognition.


dow is used to discard the collector), and iii) the focus windowwhich contains all the occurrence dates for which incomingevent should be integrated immediately by the engine withoutbe stored by the collector.

The following subsections detail how these windows areupdated.

3.2 The “assert no more” Window.

When the Event Router module receives an event, it executes:

Route event (e, t):- Send e either to the engine or to the collector.- ∀ e’ of level 1 do:

- for stream e′, update and propagate gw(e′), fc(e′)- ∀ f of level ≥ 2 do:

- for collecting stream f , update anmw(f)

Thus, we do not update the garbage window for streams (inthe engine) of high-level events. This operation will be doneby message sending from the collector.

3.3 The Exclusion Window

There are two mechanisms for cleaning the collector: the firstone is the emission from the recognition engine of a no needevent message (for instance, no need f ), on a particular win-dow (which is the intersection of the complementary of theforthcoming windows of all instances of all chronicle modelsinvolving f ); then the collector simply cleans correspondingevents and updates the exclusion window, ew(f), in the col-lecting stream f :

Exclude window (f, I):- remove all f in the extended interval I- ew(f)← I

The exclusion window expresses the fact that if an eventoccurs in this window, the collector do not have to store it asno chronicle is not interested in the event :

Receive Event (f, t):- if (t ∈ ew(f)) store (f, t)

The second case happens when the recognition engineasks for some events on a particular window, the collectorsends found collected events to all instances and cleans them(events in a collector are sent one and only one time to allinstances):

Extract event f on I ∈ I:- ∀(f, t) ∈ I , do:

- send (f, t) to recognition engine- T = I∩]−∞, t] ∩ anmw(f)- ew(f)← ew(f) ∪ T- Engine: assertNoMoreEvent(f, ew(f))

- T = I ∩ anmw(f)- ew(f)← ew(f) ∪ T- Engine: assertNoMoreEvent(f, ew(f))

In this algorithm we assume that the collector sends eventsf ordered by date. The extended interval T = I∩]−∞, t] ∩anmw(f) expresses the fact that after receiving f at t, theengine will not receive any more f whose date is in I∩] −∞, t] except if the collector can still receive some f before t.

As the forthcoming window of an event is decreasing in thesubset order from ] −∞,+∞[ to ∅ , the exclusion windowis increasing for the same order. Note that the assert no morewindow and the exclusion window are different and generallydo not include each other.

3.4 The Focus WindowThere are (at least) three reasons for introducing the focuswindow: the main one is that, in addition to the search in thepast for some high-level events, we want to predict forthcom-ing windows for high-level events; the second one is when

two events are not totally ordered, for example: e[a,b]−→ f

(in constraint graph) with a < 0, b > 0 and level(f) = 2,

level(e) = 1; the third one, when f[a,b]−→ e, with a, b ≥ 0,

with e FIFO but f having a delay greater than b. When (f, t)arrive after e inside the fw(f), it must be immediately sent tothe engine.

Receive focused event (f, t):- send (f, t) to recognition engine- T = fw(f)∩]−∞, t] ∩ anmw(f)- ew(f)← ew(f) ∪ T- Engine: assertNoMoreEvent(f, T )- fw(f)← fw(f) ∩ T

Contrary to the assert no more window and the exclusionwindow the focus window is neither increasing nor decreas-ing for the subset order.

3.5 SummaryFinally, the whole algorithm is composed with three blocks:

Main Algorithm:- ∀(e, t) incoming event:

- Integrate event (e,t)- ∀f of level ≥ 2 do

- Compute exclusion window of f : If

- Collector: Exclude window (f, If ).

Finish Level Detection:- if(∃ instance of C which finishes integration for level ≤ n)

- ∀f of level n + 1 do- If = forthcoming window of f .- Collector:

- fw(f)← fw(f) ∪ If

- Extract Event (f, If )- fw(f)← fw(f) ∩ ew(f)

Collector integrate (f, t):- if (t ∈ fw(f))Receive focused event (f, t)

elseReceive event (f, t)


3.6 Partial order and relative frequenciesIn figure 1, if e is much more frequent that f , there is noneed to leveled the chronicle, as, if no f was arrived, no ewill initiate new instances, thus the number of instances willbe the number of f . So relatively to an other event e, eventf (very frequent) should be leveled only if e ≤ f (in thepartial order induced by the constraints), in particular when

e and f are not comparable, for example e[−1,2]−→ f . We can

decompose instances into two categories: the instances wheref is before e for which the leveled will be particular efficientand the other part where f is after e, in this case the eventf should be directly sent to the recognition engine which isdone by the use of the focus window.

An other case where focalization is necessary: let supposea chronicle made of p + 1 events, an event a and p eventse1, . . . , ep, where ei ≤ a. Suppose that relative frequenciesbetween all events are similar, but the sum of the frequenciesfor the ei are much greater than this of a. The previous mech-anism could be optimized using a hierarchical structure. Notethat, this is also necessary when more than one occurrence off is required for the recognition (see section 6) .

4 Hierarchical Recognition of Chronicles.4.1 DefinitionA hierarchical chronicle is a pair (C, h1, . . . , hn) where Cis the base chronicle and hi are the sub-chronicles; we as-sume that event types involved in C can take value in the setof sub-chronicle labels h1, . . . , hn. We treat only deter-ministic hierarchical chronicles4, i.e., we do not allow twosub-chronicles having the same label, so, in the following,we make no distinction between the chronicle and its label.Moreover we suppose that each hi has a distinguished eventbi. The hierarchical chronicle C will be recognized if it is(classically) recognized where integrated events labeled bya sub-chronicle hi have the date of a integrated event bi ina recognized instance of hi. In other words, when a sub-chronicle hi is recognized, it “sends” to all chronicles theevent (hi,date of bi).

In figure 4, the hierarchical chronicle H = (C, h, k)possesses two sub-chronicles h and k, the chronicle k isformed with two types of events: a basic one, f (whichcomes from the supervised system) and one representing thesub-chronicle h. The distinguished events are respectivelye and h. For a log containing (e, 0)(g, 1)(e, 3) an instanceof h would be recognized, so an event (h, 3) will be sent toother chronicles. If the engine receives (f, 4), an instanceof k is recognized and produces event (k, 3). So, for thelog (e, 0)(g, 1)(e, 3)(e, 4)(g, 4)(f, 4)(e, 6), C is recognizedwith events (h, 3)(k, 3)(e, 4)(h, 6) (event (h, 6) is producedby (e, 3)(g, 4)(e, 6)).

Let C be a chronicle with constraint graph G and with anevent h, where h is a subchronicle with distinguished ele-ment b and with constraint graph V . Expanding the chronicle

4There is no technical difficulty to consider non-deterministic hi-erarchical chronicles. The main difference is that expansion (see be-low) leads to more than one chronicle. This is a possible way tointroduce disjunction in chronicle formalism

h

h e

k h

[0,1]

[-1,2]

[2,3]

e g

e

[1,1][2,3]

k

h f[1,2]

C

Figure 4: The hierarchical chronicle (C, h, k).

e

f

[0,1] [-1,2]

[2,3]

e

g

e

[1,1]

[2,3]

e g

e

[1,1]

[2,3]e

g

e

[1,1]

[2,3]

[1,2]

Figure 5: The expansion of the hierarchical chronicle C.

C (or its graph) is replacing the node h ∈ G by the graphV , specifying that the constraints between b and G\V are theconstraints of h ∈ G. Let the relation hi → hj be definedif hj contains the event type hi. The hierarchical chronicle isstructurally consistent if ∀i, hi →∗ hi. In this case the ex-pansion of a hierarchical chronicle is well defined (the graphof the figure 5 is the expansion of (C, h, k)). A hierarchicalchronicle C is consistent if there exists a set of events s.t. Cis recognized. It is straightforward that a (structurally consis-tent) hierarchical chronicle is consistent iff when expandingsub-chronicles the obtained constraint graph is consistent.

4.2 Condition for Hierarchizing a ChronicleHierarchical chronicles can come from two ways: the firstone is that the chronicle is initially specified in a hierarchicalmanner, for example if the architecture of the system is itselfhierarchical; the second one is, starting from a flat chronicle,we identify identical patterns inside this chronicle. The nextproposition is a necessary and sufficient condition to the hier-archization of a (temporal) pattern and shows we have to takecare in pattern factorization: two subgraphs could be identi-cal (with same constraints) but one of them can satisfy thecondition and the other not.

Proposition 4.1 Let G be a minimal constraint graph andU ⊆ G a subgraph. Let a ∈ U a (distinguished) node andH the hierarchical chronicle defined by G′ = (G\U) ∪ hand h = U , where h ∈ G′ has the same constraints as nodea. Then H and G recognize the same events iff

∀b ∈ U,∀c ∈ G\U,Dbc = Dba + Dac (1)

where Dxy are the time constraints between G’s nodes.

Proof: It suffices to prove that the minimal graph of theexpansion graph, GH , of H coincide with G. This isimmediate by noting that when computing the minimal graph


Chronicle Model C :instance C1,C2,...

Chronicle Model h (level 2):instances h1,h2,...

Chronicle Model k (level 3): instances k1,k2,...

stream e

stream f

stream g

stream h

stream k

Figure 6: Streams for the hierarchical chronicle of figure 4;bold arrows correspond to the feeding of event streams (byincoming events or by events produced by recognized chron-icles) and thin arrows correspond integration of events by therecognition engine.

of GH the constraints of GH are not reduced and all addedconstraints are between U and (G\U) and those constraintsare redundant constraints.

4.3 RecognitionsIn the architecture of the hierarchical engine, instead of us-ing one recognition engine per sub-chronicle, we put all to-gether in the same engine and the streams accept events com-ing from the system and from recognized instances of sub-chronicles. For example, for the chronicle of the figure 4,the stream corresponding to the high-level event type h (seefigure 6) accepts events coming from instances of chroni-cle model h (bold arrow) and is integrated into instances ofchronicle models k and C (thin arrows).

5 Focalized Hierarchical Recognition ofChronicles.

In this section, we present how the focalization and the hierar-chization could be mixed. In the architecture of the collector,we need to add a module which transforms exclusion and fo-cus window of high level events into those of the differentcollecting streams. The assert no more window should alsobe adapted.

Updating the assert no more windowWhen receiving an event, we update the garbage window ofall streams except those corresponding to high-level eventsand we update the assert no more window for collectingstreams. For example, when receiving f , we update streame and collecting streams e, g and f .

Updating the exclusion windowWhen the collector receives a message sent by the recogni-tion engine, saying it does not need high level event f on theextended interval, the collector needs to update the exclusionwindow of all events in the chronicle f , but needs also to takeinto account all the exclusion windows of other high-levelevents. So, the exclusion window is given by:

ew(e) =⋂

e→fi

ew(fi)/De,bi

where De,biis the time constraint from e to the distinguished

node bi of fi and where [u, v]/[a, b] = [u − a, v − b] if u −a ≤ v − b and [u, v]/[a, b] = ∅ otherwise and with naturalextension to the set of extended intervals.

stream k

stream h

stream eChronicle Model C :instance C1,C2,...

Col. stream h

Col. stream g

Col. stream f

Col. stream e

Engine

Collector

Chronicle Model h (level 2):instances h1,h2,...

Chronicle Model k (level 3): instances k1,k2,...

Figure 7: Communications between streams, collectingstreams and instances.

Updating the focus windowThe focus window of the collecting stream of an event e in-volved in high-level events f1, . . . , fn is given by

⋃i fw(fi)+

Dbi,e where Dbi,e is the time constraint from the distinguishnode bi of fi to e.

RecognitionFrom the structural point of view, recognition follows thesame principle as hierarchical recognition but with commu-nications (in both ways) between collecting streams and in-stances. For our example in figure 4, with level(h) = 2and level(k) = 3, communications are described in figure7: events of type e are sent to stream e (and then to C) and tocollecting stream e (bold arrows). Collecting streams e and gsend required events to instances of h (thin arrows). If h isrecognized, an event h is sent to stream h (and then to C) andto collecting stream h and so on.

Remark: An example of gain of focalization vs. techniquepresented in [Vu et al., 2003], is that for subgraphs with [0, 0]constraints (co-occurrence), we don’t have to explore combi-nation of partial instantiation but only store events and waitfor an event in the future in order to extract from the collectorevents with the “good” parameters; thus all the combinatorialexplosion of elementary scenarios is avoided.

6 ExperimentationTo summarize, the first example shows a major gain of per-formance mainly obtained by the focalization technic and thegain for the second example is principally due to the hierar-chisation. Those experiments had been realized under MacOS X: 2*1GHz PPC G4 - 1Gb SDRAM. The algorithms areimplemented in Java (JRE 1.4).

6.1 Reflected Distributed DoS Attack

In RDDoS attack, machines with the spoofed IP address ofthe victim send SYN messages to naives servers. Then, those


2 L 3 L 4 L

15

108

2

5

25

CRS + clock

CRS + gw

fh-CRS + clock

5 L

32

0 Reco.

17

22

30

1L

2 Reco.

6 Reco.

4 Reco.

1 Reco.

min

fh-CRS + gw

opt. CRS +gw

Figure 8: Comparison of process time for the RDDoS log.

servers will reply by sending to the victim SYN/ACK mes-sages generating a massive flooding. Characteristic of the at-tack is that SYN traffic to naive server is low, persistent and,taken alone, do not trigger an alarm. We want to identify thenaive servers. In our experiment, information on the globaltraffic are sent way up by some core routers. Preprocessingis done by a particular numerical algorithm which computesthroughput between pairs of IP addresses and sends alarmswhen this throughput is greater than two thresholds: a lowone (L events) and a high one (H events).

So CRS receives two kinds of events: H[ip dest] andL[ip scr, ip dest] where variables are IP address of thesource and the destination of the throughput. The leveledhierarchical chronicle is (the “flat” one is obtained usingexpansion):

Chronicle RDDoS[ip naif]:(level 2) SynLow[ip dest, ip naif], ts(level 1) H[ip dest], ta(level 1) H[ip dest], tb

ts[0,120]−→ ta

[0,60]−→ tb

Chronicle SynLow[ip src, ip dest]L[ip src, ip dest],t1L[ip src, ip dest],t2L[ip src, ip dest],t3 (distinguished event)

t1[0,120]−→ t3, t1 < t2 < t3

We do not discuss here the choice of threshold nor the rel-evance of the chronicle, our aim is to present performance offocalization face to huge amount of data. The log contains240000 events (L and H), its period is equal to 6min, fre-

quencies are660 L/sec and3 H/sec. Due to a lack of syn-chronization of routers the delay is set to 60s. On figure 8, wecompare processing time for 5 engines: on the expanded (flat)chronicle, previous CRS with clock and CRS with garbagewindow, and on the corresponding hierarchical chronicle us-ing the focalized hierarchical (fh) engine. We also compareto an optimal log where all H events are sent before L events(only possible on fixed log). In order to change the com-plexity, we fluctuate the number of events L in sub-chronicleSynLow (from 1 L to 5 L).

For the chronicle with 3 L events, results are summarizedin the following table:

processing created maximaltime instances collector’s size

CRS+c 17 min. 236 600 n.a.CRS+gw 14 min. 236 600 n.a.fh-CRS+c 3 min. 3550 4800fh-CRS+gw 7 s. 1840 1330opt. CRS+gw 34 s. 1850 n.a.

So the amount of created instances is considerably reducedwhen using focalization. The difference between number ofinstances when using clock (3550) and when using garbagewindow (1840) is explained on this example: for leveledchronicle C of figure 1, when f arrives at t = 1, we storeit and the clock is set to 1; but when (e, 3) arrives, the clockcan not be set to 3 otherwise all current instances would bediscarded; so, even if in the FIFO hypothesis, it is necessaryto set the delay to 3 (the upper bound of the constraint be-tween f and e). By this artifact, a (possible large) number of“false” instances are uselessly created. The introduction ofthis delay has also an incidence on the collector’s size.

6.2 Mutualisation of Cardiac PatternsOur second experiment concerns the identification of cardiacpatterns [Carrault et al., 1999]5. The log contains 3600events which are of two types: p wave[?x] and qrs[?x] with?x ∈ normal, abnormal. Those two events are extractedfrom electroencephalograms (EEG) and represent two char-acteristic events in heart cycle: the arrival of a P wave and ofa QRS complex. The leveled hierarchical chronicle we use is:

Chronicle test[]:(level 1) qrs[abnormal], t1(level 2) pattern2[normal], t2(level 2) pattern1[normal], t3(level 2) pattern1[abnormal], t4

t1 < t2 < t3[0,600]−→ t4

t1cycle k−→ t4

Chronicle pattern1[?x]noevent (qrs[*],ts,t1)p wave[],t1(distinguished event)

ts[0,600]−→ t1

where constraint “cycle k” is equal to [0, k ∗ 2000]. In the

5We wish to acknowledge the authors for providing cardiacchronicles and EEG data.


cycle2 cycle3 cycle4

30

20

10

50

70

40

60

cycle1

267 Reco.2266 Reco.

7555 Reco.20212 Reco.

80

cycle5

41167 Reco.

100K

200K

300K

400K

500K

900K

1M

CRS + gw

fh-CRS +gw

instances

instances

cycle6 cycle7

600K

700K

69527 Reco.108469 Reco.

800K

seccreated

instances

Figure 9: Comparison of process time for the cardiac log.

sub chronicle pattern1, we use the predicate noevent whichmeans that we do not want event qrs between ts and t1. Thesub-chronicle pattern2 is quite the same as pattern1.

In this experiment the frequency of events are similar; in-deed, in the log there are 1987 qrs and 1550 p wave events.So, when the number of considered cycles k increases, thecombinatory greatly increases and the gain obtained by thehierarchisation of common patterns becomes important . Theresults are presented in figure 9, triangles represent the num-ber of created instances (right graduation). As we said in theintroduction, we can see an immediate correlation betweenthis number and the performance of the recognition.

7 ConclusionThis paper presents two improvements of chronicle recog-nition. The first one is the focalization on particular eventswhich allows the system to reason on the past and on the fu-ture in a homogeneous manner. The second concerns the hi-erarchical recognition based on subpatterns. We also showedthat mixing both improvements increases the efficiency andalso fills the gap between SPRS and STRS approaches whichare completely covered.

In practice, our approach is sufficiently adaptive in orderto fine-tune the recognition system. For instance, the orderof event integration could be different from their arrival or-der (for instance, to take into account the relative frequen-cies). Moreover, leveled events could postpone some expen-sive numerical computation to generate events (by setting ahigh level to these kind of event) and avoid useless computa-tion.

Future works will take two directions: the first one is todefine a more flexible way to factorize common patterns -condition 1 (section 4.2) is too restrictive for many cases. Asdealing with the event frequency substantially increases theefficiency of CRS, the second direction will be focused ononline analysis of event frequency in order to adapt dynami-cally the hierarchical focalization.

References[Berstel, 2006] Bruno Berstel. Extending the RETE al-

gorithm for event management. Ninth InternationalSymposium on Temporal Representation and Reasoning(TIME’02), pages 49–51, July 2006. IEEE Transactions.

[Carrault et al., 1999] G. Carrault, M.O. Cordier, R. Quin-iou, M. Garreau, J.J. Bellanger, and A. Bardou. A model-based approach for learning to identify cardiac arrhyth-mias. Artificial Intelligence in Medecine and Medical De-cision Making, 1620:165–174, 1999. W. Horn et al. edi-tors.

[Cordier and Dousson, 2000] M. O. Cordier and C. Dous-son. Alarm Driven Monitoring Based on Chronicles. InProc. of the 4th Symposium on Fault Detection Supervi-sion and Safety for Technical Processes (SAFEPROCESS),pages 286–291, Budapest, Hungary, June 2000. IFAC,A.M. Eldemayer.

[Dousson, 2002] C. Dousson. Extending and unifying chron-icle representation with event counters. In Proc. of the15th ECAI, pages 257–261, Lyon, France, July 2002. F.van Harmelen, IOS Press.

[Fessant et al., 2004] F. Fessant, C. Dousson, and F. Clerot.Mining of a telecommunication alarm log to improve thediscovery of frequent patterns. 4th Industrial Conferenceon Data Mining (ICDM’04), July 2004.

[Ghallab, 1996] M. Ghallab. On chronicles : Representation,on-line recognition and learning. Proc. of the 5th Interna-tional Conference on Principles of Knowledge Represen-tation and Reasoning (KR-96), pages 597–606, November1996. Morgan-Kauffman.

[Guerraz and Dousson, 2004] B. Guerraz and C. Dousson.Chronicles Construction Starting from the Fault Model ofthe System to Diagnose. 15th International Workshop onPrinciples of Diagnosis (DX’04), pages 51–56, June 2004.

[Rota and Thonnat, 2000] Nathanael Rota and MoniqueThonnat. Activity recognition from video sequences usingdeclarative models. 14th ECAI, pages 673–677, August2000. W. Horn (ed.), IOS Press.

[Vautier et al., 2005] Alexandre Vautier, Marie-OdileCordier, and Rene Quiniou. An inductive database formining temporal patterns in event sequences. Workshopmining spatio-temporal data (in PKDD05), October 2005.

[Vu et al., 2003] Van-Thin Vu, Francois Bremond, andMonique Thonnat. Automatic video interpretation: Anovel algorithm for temporal scenario recognition. 18th

IJCAI, August 2003.


AbstractTesting embedded software systems on the control units of vehicles is a safety-relevant task, and developing the test suites for performing the tests on test benches is time-consuming. We present the foundations and results of a case study to automate the generation of tests for control software of vehicle control units based on a specification of requirements in terms of finite state machines. This case study builds upon our previous work on generation of tests for physical systems based on relational behavior models. In order to apply the respective algorithms, the finite state machine representation is transformed into a relational model. We present the transformation, the application of the test generation algorithm to a real example, and discuss the results and some specific challenges regarding software testing.

1 IntroductionOver the last decade or so, cars have become a kind of

mobile software platform. There are tens of processors (Electronic Control Units, ECU) on board of a vehicle; they are communicating with each other via several bus systems, and software has a major influence on the performance and safety of a vehicle. The software embedded in the mechanical, electrical, pneumatic, and hydraulic car subsystems becomes increasingly complex, and it comes in many variants, reflecting the context of different types of vehicles, the manufacturer-specific physical realization, versions over time etc. Testing such embedded software becomes increasingly challenging and has been moving away from test drives under various conditions to automated tests performed on test benches which can partly or totally simulated the car as a physical system.

But for the reasons stated above, namely complexity of the software and its variation, generating the test suites becomes demanding and time consuming and demands for computer support. Automating the generation of such tests based on a specification of the desired behavior of the software together with the physical system promises benefits regarding both the required efforts and the completeness of the result.

In [Struss 94, 94a], we presented the theoretical and technical foundations for automated test generation for

physical systems based on models of their (nominal and faulty) behavior. Such behaviors are represented as (finite) relations over system variables which characterize the possible states under different modes of behavior. On this basis, tests can be computed as sets of stimuli that trigger disjoint projections of the behavior relations to the space of observables.

An extension of this approach to cover also software would be highly beneficial, because it would provide a coherent solution to testing both physical systems and their embedded software. More concretely, the software test could start from a specification of the intended behavior of the entire system (including physical components and software), and also the tests could reflect the particular nature of the embedded software, namely using stimuli and observations of the physical system rather than directly of the software system.

The case study described in this paper concerns a real-life example (the measurement and computation of the fuel level in a vehicle tank) based on the requirement specification document of a car manufacturer.

We continue by summarizing the basis for our relation-based implementation of test generation. In order to extend it to software, the requirement specification has to be turned into a relational representation. In the respective document, the skeleton of this specification is provided in a state-chart manner. Therefore, section 3 of this paper proposes a behavior specification as a special finite state machine, and section 4 presents the transformation into a relational representation.

A major challenge in the application of the test generation algorithm to software is to provide relevant and appropriate fault models against which the software should be tested (section 5). The final sections present results of the case study and discuss problems and insights.

2 The Background: Model-based Test Generation

In the most general way, testing aims at finding out which hypothesis out of a set H is correct (if any) by stimulating a system in such a way that the available observations of the system responses to the stimuli refute all but one hypotheses (or even all of them). This is captured by the following definition.

Definition (Discriminating Test Input)

Model-based Test Generation for Embedded Software

M. Esser1, P. Struss1,2

1 Technische Universität München, Boltzmannstr. 3, D-85748 Garching, Germany2 OCC’M Software, Gleissentalstr. 22, D-82041 Deisenhofen, Germany

esser, struss@ in.tum.de, struss@ occm.de


Let TI = ti be the set of possible test inputs (stimuli),OBS = obs the set of possible observations (system responses), and H = hi a set of hypotheses.ti ∈ TI is called a definitely discriminating test input for H if(i) ∀ hi ∈ H ∃ obs ∈ OBS ti ∧ hi ∧ obs O ⊥ , and(ii) ∀ hi ∈ H ∀ obs ∈ OBS

if ti ∧ hi ∧ obs O ⊥then ∀ hj ≠ hi ti ∧ hj ∧ obs P ⊥.

ti is a possibly discriminating test input if (ii´) ∀ hi ∈ H ∃ obs ∈ OBS such that

ti ∧ hi ∧ obs O ⊥ and ∀ hj ≠ hi ti ∧ hj ∧ obs O ⊥.In this definition, condition (i) expresses that there

exists an observable system response for each hypothesis under the test input. It also implies that test inputs are consistent with all hypotheses. i.e. we are able to apply the stimulus, because it is causally independent of the hypotheses. Condition (ii) formulates the requirement that the resulting observation guarantees that at most one hypothesis will not be refuted, while (ii’) states that each hypothesis may generate an observation that refutes all others.

While testing for fault identification has to discriminate between each single pair of hypotheses (if possible), testing for confirming (or refuting) a particular hypothesis h0 requires only discrimination between h0 and any other hypothesis. Usually, one stimulus is not enough to perform the discrimination task which motivates the following definition.

Definition (Confirming Test Input Set)tik = TI´ ⊂ TI is called a discriminating test input set for H = hi and h0 ∈ H if

∀ hj with h0 ≠ hj ∃ tik ∈ TI´ such that tik is a discriminating test input for h0, hj.

It is called definitely confirming if all tik are definitely discriminating, and possibly confirming otherwise. It is called minimal if it has no proper subset TI´´⊂ TI´ which is discriminating.

Remark Refutation of all hypotheses hj ≠ h0 implies h0 only, if we assume that the set H is complete, i.e. ∨i hi

Such logical characterizations (see also [McIlraith-Reiter 92]) are too general to serve as a basis for the development of an appropriate representation and algorithms for test generation. Here (in test generation for physical systems), the hypotheses correspond to assumptions about the correct or possible faulty behavior of the system to be tested. They are usually given by equations and implemented by constraints, and test inputs and observations can be described as value assignments to system variables.

The system behavior is assumed to be characterized by a vector vS = (v1, v2, v3, … , vn) of system variables with domains DOM(vS) = DOM(v1) × DOM(v2) × … × DOM(vn).Then a hypothesis hi ∈ H is given as a relation Ri ⊆ DOM(vS).

For conformity testing, h0 is given by R0 = ROK, the model of correct behavior. Observations are value assignments to a subvector of the variables, vobs, and also

the stimuli are described by assigning values to a vector vcause of susceptible (“causal” or input) variables. We make the reasonable assumption that we always know the applied stimulus which means the causal variables are a subvector of the observable ones: vcause ⊆ vobs.Example

To illustrate and motivate the following formal treatment, let us consider a trivial task: generate tests that discriminate an open from a correct resistor. The obvious proposed test is to apply a voltage drop which is guaranteed to be non-zero and check whether or not we observe a non-zero current. Fig. 1 displays descriptions of the behavior under each mode in qualitative terms for the stimulus (a pair of voltages) and the observable current. They capture the information that only applying (qualitatively) different voltages guarantees a non-zero flow in the ok mode which can be distinguished from a zero current enforced by an open resistor and, hence specifies a definitely discriminating test input.

As suggested by the example, the basic idea underlying model-based test generation ([Struss 94]) is that the construction of test inputs is done by computing them from the observable differences of the relations that represent the various hypotheses. Fig. 2 illustrates this. Firstly, for testing, only the observables matter. Accordingly, Fig. 2 presents only the projections, pobs(Ri), pobs(Rj), of two relations, R1 and R2, (possibly defined over a large set of variables) to the observable variables. The vertical axis represents the causal variables, whereas the horizontal axis shows the other observable variables (representing the observable system response).

To construct a (definitely) discriminating test input, we have to avoid stimuli that can lead to the same observable system response for both relations, i.e. stimuli that may lead to an observation in the intersection pobs(Ri) ∩ pobs(Rj) shaded in Fig. 2. In our example, this intersection contains only current=0. These test inputs are computed by projecting the intersection to the causal variables:

pcause(pobs(Ri) ∩ pobs(Rj) ) .This yields the pairs with equal voltage in the example.

The complement of this (i.e. the pairs of unequal voltages) is the complete set of all test inputs that are guaranteed to

Figure 1: Behavior description of a correct and an open resistor


produce different system responses under the two hypotheses:

DTIij = DOM(vcause) \ pcause(pobs(Ri) ∩ pobs(Rj)) .

Lemma 1If hi=Ri, hj=Rj, TI=DOM(vcause), and OBS=DOM(vobs), then DTIij is the set of definitely discriminating test inputs for hi, hj.

Please, note that we assume that the projections of Ri

and Rj cover the entire domain of the causal variables which corresponds to condition (i) in the definition of the test input.

We only mention the fact, that, when applying tests in practice, one may have to avoid certain stimuli because they carry the risk of damaging or destroying the system or to create catastrophic effects as long as certain faults have not been ruled out. In this case, the admissible test inputs are given by some set Radm ⊆ DOM(vcause), and we obtain DTIadm, ij = Radm \ pcause(pobs(Ri) ∩ pobs(Rj)) .

In a similar way as DTIij, we can compute the set of test inputs that are guaranteed to create indistinguishable observable responses under both hypotheses, i.e. they cannot produce observations in the difference of the relations:

(pobs(Ri) \ pobs(Rj)) ∪ (pobs(Ri) \ pobs(Ri)).Then the non-discriminating test inputs are

NTIij = DOM(vcause)\ pcause((pobs(Rj)\ pobs(Ri)) ∪ (pobs(Ri)\ pobs(Rj)))

All other test inputs may or may not lead to discrimination.

Lemma 2The set of all possibly discriminating test inputs for a pair of hypotheses hi, hj is given by

PTIij = DOM(vcause)\ (NTIij ∪ DTIij ) .

The sets DTIij for all pairs hi, hj provide the space for constructing (minimal) discriminating test input sets.

Lemma 3The (minimal) hitting sets of the set DTI0j are the (minimal) definitely confirming test input sets for H, h0.

A hitting set of a set of sets Ai is defined by having a non-empty intersection with each Ai. (Please, note that Lemma 3 has only the purpose to characterize all discriminating test input sets. Since we need only one test

input to perform the test, we are not bothered by the complexity of computing all hitting sets.)

This way, the number of tests constructed can be less than the number of hypotheses different from h0. If the tests have a fixed cost associated, then the cheapest test set can be found among the minimal sets. However, it is worth noting that the test input sets are the minimal ones that guarantee the discrimination of h0 from the hypotheses in H. In practice, only a subset of the tests may have to be executed, because some of them refute more hypotheses than guaranteed (because they are a possibly discriminating test for some other pair of hypotheses) and render other tests unnecessary.

The algorithm has been implemented based on software components of OCC’M’s RAZ’R ([OCC’M 05]) which provide a representation and operations of relations as ordered multiple decision diagrams (OMDD). The input is given by constraint models of correct and faulty behavior of components taken from a library which are aggregated according to a structural description.

Finally, we mention that probabilities (of hypotheses and observations) can be used to optimize test sets ([Struss 94a], [Vatcheva-de Jong-Mars 02]).

3 State Charts for Specification of Software Requirements

State charts and finite state machines (FSM) are frequently used in specifications of software requirements. Figure 3 shows a FSM extracted from a requirement specification produced by an automotive manufacturer. The machine describes a process to detect refueling of a passenger car: if the car stops for more than 8 seconds and if a remarkably higher tank filling is detected then the software sets the output flag RFD (ReFilling Detected) to true. Otherwise RFD is always false.

Let us define the used type of FSM in a more formal way: an automata ma = (E,(I,O,L),(S,A),T,s0,l0) is described by

• the set E of events e1, …, enE,• the ordered set I of input variables i1,… ,inI,• the ordered set O of output variables o1, …, onO,• the ordered set L of local variables l1, …, lnL,• the set S of control states s1, …, snS,• the set A of state expressions a1, …, anS defining a

relation δa,i ⊂ dom(I) x dom(L) x dom(O) x dom(L) for each state si,

• the set T of transitions T1, …, TnT with Ti ⊂ S x P(dom(E) x dom(I) x dom(L)) x S where P(X) denotes to the power set of X,

• the initial control state s0 and• the vector l0 with the initial values of L.

Each machine has a special local variable l1 called stime indicating the time elapsed since the machine has entered the actual control state. It is special because each time the control state is switched, the variable is reset automatically. Every variable v in I, O or L has a finite domain dom(v).

Figure 2 Determining the inputs that do not, possibly and definitely discriminate between R1 and R2

vcause

vobs\cause

Not discriminable(NTI)

Definitely Discriminable(DTI)

Possibly discriminable(PTI)

R1R2

vcausevcause

vobs\causevobs\cause

Not discriminable(NTI)

Definitely Discriminable(DTI)

Possibly discriminable(PTI)

R1R2


With the inputs (i1, …, in) and the events (e1, …, en) the machine produces the outputs (o1, …, on) according to the following operating sequence:

1. Set t = 0.2. Evaluate the state expression ai of the current

state st=si to calculate the new values of the output and local variables : (it+1, lt, ot+1, lt+1) ∈ δa,i

3. If T contains a transition Ti=(ssrc, IF, sdst) with ssrc=st and (et+1,it+1,lt+1)∈IF then set st+1=sdst, otherwise set st+1=st.

4. If st+1≠ st then reset stime.5. Set t=t+16. Jump to Step 2.

In our example, the FSM has two input variables car moves and ∆time, one output variable RFD, stime as a the only local variable and the events nothing, car starts moving, car stops and increased tank filling. The variable ∆time is set according to the time elapsed since the previous event occurred. Its value is always added to the stime variable, which could be used in a precondition of a transition.

Dependent on the chosen set of input variables I and the events E, the test generation system needs more information in order to produce meaningful tests, because the values of some variables might depend on the occurrence of an event. E.g. if car movest=true then the event car starts moving can not occur next. In our example, the following rules are necessary: car movest = false ∧ car movest+1 = true ⇔ et = car starts moving car movest=true ∧ car movest+1=false ⇔ et=car stops

In the next section, we describe how the FSM is transformed into a relational representation.

4 Transformation of a FSM into a Compositional, Relational Representation

The conversion of a FSM of the described type produces a compositional model, i.e. a model that preserves the structure and the elements of the FSM. As a consequence, a modification of one part of the FSM results in the

modification of only one part of the compositional model (As it will turn out this is not fully accomplished for fundamental reasons). The compositional model also provides the possibility of relating “defects” to the various elements (and also to record and trace their effects e.g. in diagnosis).

The basic step is the transformation of the entire FSM into a component C1Step and its internal structure (Figure 4). C1Step takes the state st, values of local variables lt, the input vector it+1, and the event et+1 and generates the subsequent state st+1, the new values of local variables lt+1, and the output vector ot+1, reproducing the calculations of the FSM in one step (one iteration in the listed operation sequence). C1Step consists of the two components CState and CTrans. The former encodes the state expressions δa,i, while CTrans represents the transitions Ti.

CState constrains st, lt, lt+1, ot+1 and it+1

independently from the next event et+1. It contains nS atomic components Cai, one for each state expression ai, which are placed in parallel (Figure 5). The expressions are conditioned by their respective state and, hence, exactly one component Cai defines the proper values of the variables. Hence, a change in one ai results in the modification of only one component and a maximum of locality is achieved.

Cai determines lt+1 and ot+1 depending on st, lt and it+1

according to ai. The relational model RCai of such an atomic component is:

CTrans correlates all the variables except the output ot+1

and consists of nT parallel atomic components CTi, one for each transition, and a component CTDefault (Figure 6). Exactly one component CTi determines st+1 depending on st, lt+1, it+1 and et+1 according to Ti. The relational model RCTi of these atomic components are:

In all cases where no transition is executed, the atomic component CTDefault defines the values according to the automata definition: the state does not change, st+1 = st. Therefore its relation is

Now one iteration of the operating sequence can be simulated. To simulate n iterations, C1Step is copied n times and placed in series. But this shows also a limitation: the model can simulate only a fixed number of steps, and the more C1Step components are interconnected the bigger the model (the relation of the entire model) grows.

The number of steps needed for test generation depends on the respective FSM and the failure. In order to

( )( )( )1

1 1 1

1 1( , , )

, , , , |

( , , )Default

t ti

t t t t t

CT t t t t tT s IF s

s e i l sR

s s e i l IF+

+ + +

+ +=

= ∀ ≠ ∨ ∉ % %

%

( )( )( ) ( )( )( ) ( )( )

1 1 1

1 1 1

1 1 1 1

: , , , ,

, , ( , , )

, , ( , , )

|

i

i

t t t t tCT

t t t t tT T i

t t t t t t t ti

R s e i l s

T s IF s e i l IF

T s IF s s s e i l IF s S

+ + +

+ + +∈

+ + + +

= ∧ ∈ ∨

= ∧ ≠ ∨ ∉ ⇒ ∈

∃% % %

Figure 3: FSM describing a refilling detection in a personal car

( ) ( )( ) ( )( )

1 1 1 1 1 1

1 1

, , , , | ( , , , )j

j

t t t t t t t t t tj a

Ca t t tj

s i l l o s s i l o lR

s s l L o O

δ+ + + + + +

+ +

= ∧ ∈ ∨ = ≠ ∧ ∈ ∧ ∈


discriminate the ok-model from the failure model, n has to be at least as long as the shortest path in which effects of the fault becomes observable. One solution to this problem could be to start with a small number of steps and increase it until the system produces some tests.

A violation of locality becomes evident when the set of transitions is changed, e.g. by deleting, adding, or modifying one. In such cases, not only the respective CTi

component has to be removed, added, or changed, but also the default behavior in CTDefault has to be updated.

5 Fault ModelsAs described in section 2, our approach to testing is based on trying to confirm the correct behavior by refuting the models of possible faulty behaviors. When testing systems that are composed of physical components only, these models are obtained in a natural way from the fault models of elementary components, which usually have a small set of (qualitatively different) foreseeable misbehaviors due to the underlying physics. Faults due to additional interactions among components are either neglected or have to be anticipated and manifested in the model. In summary, for physical systems, the specific realization of the system determines the possible kinds of misbehavior, and testing compares them to a situation where all components work properly.

In software testing, this does not apply. First, the space of possible faults is not restricted by physical laws, but only by the creativity of the software developer when making mistakes. This space is infinite, and the occurrence of structural faults is the rule rather than an exception. Second, the assumption that correct functioning of all (software) components assures the achievement of the intended overall behavior does not hold. This marks an important difference between testing physical artifacts and software. For the former, we can usually assume it was designed correctly (which is why correct components together will perform correctly), but for the software we cannot. It is just the opposite: testing aims at revealing design faults.

In our application, the situation is complicated by the fact that it starts from the functional requirements rather than a detailed software design or even the code which might suggest certain types of bugs to check for (e.g. no termination of a while loop). On the positive side, this may lead to a smaller, qualitatively characterized set of possible misbehaviors.

In our example about the detection of fuel refilling, a failure one might think of is that the software does not poll the car’s movement during driving and therefore does not detect a stop. This means the machine stays in its current control state instead of performing T3. The Transition T3

could be seen as deleted. The construction of such a failure model could be achieved by applying the following operator on the ok-model:

remove-if-condition: (ma, Ti) → (ma’)where ma’ = ma[IFi → ∅] and Ti=(st,IFi,st+1).Operation ma[A → B] results in a FSM ma’

which is equal to ma except that element A is substituted with B.

Another faulty behavior would occur, if the software treats an increased tank level after 8sec in standstill exactly as if the car starts moving. W.r.t. the FSM, this

means executing T6 instead of T7. The proper failure model can be constructed by the operator

move-if-condition-to: (ma,Ti, Tj) → (ma’)where ma’ = ma[IFi → IFi∪ IFj, IFj → ∅] and Ti=(st,IFi,st+1).

6 ResultsIn this section the discrimination of two failure models from the ok model is discussed. These failures are:

• mdelT3 = remove-if-condition(mok,T3)• mdelT5 = remove-if-condition(mok,T5)

A relational model that simulates 7 steps of the FSM is used here.

Discrimination between mok and mdelT3

Only two types of tests are generated to discriminate these two models. Figure 7 lists them, where ‘*’ stands for any value in the domain of the respective variable. The trajectories caused by the test inputs are shown in Figure 9. The input sequence of the first test could be formulated more naturally as following:

et+1 it+1

(lt+1,st) st+1CT1

CT2

CTn

...

(lt+1,st)

et+1 it+1st+1

st+1

st+1

CTDefault

(lt+1,st)

(lt+1,st)

(lt+1,st) st+1

Figure 6: CTrans and its internal structure

ot+1

it+1

(lt,st) lt+1Ca1

Ca2

Can

...

(lt,st)

(lt,st)

(lt,st)

ot+1

it+1lt+1

lt+1

lt+1

Figure 5: CState and its internal structure

CTrans

CState

et+1it+1

lt

st

ot+1

lt+1

st+1st+1

lt+1

Figure 4: C1Step and its internal structure


1. starting from the initial state one waits 4s long, 2. then the car starts moving and3. directly after this, it stops again and4. one waits again 4s.5. After waiting a third time 4s,6. a significant increase of the tank filling is

detected.

Discrimination between mok and mdelT5

To discriminate these two models, 36 different types of tests are generated. The two tests of the previous discriminations are also among them. Some of them are shown in Figure 8. In test 2, the second and the third event occurring are “increased tank filling”. These events are unnecessary. Without these two steps the test input still discriminates the fault from the ok model. The reason that the system generates these is the fixed number of steps of the relational model. So some steps have to be filled with events having no effects but serving as placeholders. This explains why so many different tests are generated.

Eliminating unnecessary stimuli is addressed in [Struss05].

Discrimination of both pairsTests discriminating between both pairs (mok from mdelT3 as well from mdelT5) are the two from the first discrimination, because these are also in the generated set of the second one. In our example this is not surprising. To distinguish between an ok automata and a fault automata where a transition is deleted, one of these both has always to reach state S6, because only there the output is different to the one of the others.

7 Related Work

Comparison to classical test generation approachesClassical approaches generates tests optimized in respect to a certain coverage criteria like state, transition or MCDC coverage [Beizer95]. In our approach, with

carefully chosen sets of failure models tests will be generated that achieve also classical coverage criteria.To obtain a state coverage, for

example, a set of failure models Mfail

could be constructed as follows. For each state si, there exists one failure model mfail,i in Mfail which differs from the ok-model in the output of state expression ai only. The outputs of these two models are complementary. For the case that mok

is a deterministic automaton, the equivalence is proven in [Esser05].

Comparison with the diagnoser approachAlso the diagnoser of [Sampath 96] could be used for test generation (although the authors are not aware of any publication describing this): In this approach a diagnoser is generated from the system model, both FSMs, for calculating diagnosis and diagnosability. The transitions of a diagnoser are labeled with observable events, whereas the states are labeled with the behavior modes consistent with the events that occurred so far.

For test generation, the set of observable events could be split into causal and non-causal observable events. Then, the task is to find a causal event sequence where each diagnoser path consistent with these causal events and any possible non-causal observables has either only a ok-label or no ok-label at all. Each

Figure 7: tests discriminating mok from mdelT3

Figure 8: tests discriminating mok from mdelT5


sequence of causal and observable events is a valid test for the modeled failures.

We expect that the two approaches can be transformed into each other. The kind of automata used in the diagnoser:

• do not include input/output/local variables. Instead, observable/ unobservable/failure events are used.

• Instead of seperate failure models, only one model is used that includes all relevant bahavior modes.

A mapping from a FSM A as described in this paper to a state machine A' of the system model for the diagnoser approach could be outlined as following:

• Each possible value combination of input and output variables as well as for the event in A is represented by its own observable event E' in A':

fE: (I,O,E) → E'• Each possible combination of state s and values of

local variables L in A is mapped to its own state in A'. This may result in a quite huge state set.

fS: (S,L) → S'• A transition T' with an observable event e' between

two states s1',s2' in A' exists, iff there exists a proper transition T and state expression X in A:

T' = (s1',e',s2') exists, iff(i,o,e) = fE

-1(e') ∧ (s1,l1) = fS-1(s1') ∧

(s2,l2) = fS-1(s2') ∧ (ι,l1,ο,l2) ∈ δA(s1) ∧

Τi = (s1,IF,s2) ∈ Τ(Α) mit (e,i,l2) ∈ IF• Finally, the resulting state machines A' for the

different behavior modes have to be merged to a single one by introducing transitions which use failure events and which specifies the differences between the individual machines.

An analysis and comparison of the efficiency would be interesting. However our main goal here was to find a formalism to use the same relational approach on software as we use already for test generation of physical devices.

8 DiscussionThe problem which is central to our approach is finding appropriate fault models representing realistic and relevant faults. On the one hand, they are difficult to obtain for software and even more so, when one starts from a functional specification, as we do. This may seem to be a disadvantage in comparison with the other testing heuristics, like coverage criteria. However, it is not true that they do not involve fault models. In fact, they are based on assumptions about possible faults, but these are implicit. The fact that our approach makes them explicit is a major advantage and the basis for more progress. It also bears the potential to generate tests whose power and coverage grows together with the refinement of the specification during the development process.

We consider the results of this experiment as encouraging and will continue this work in a project with Audi AG. It has raised a number of issues that need to be addressed in this project.

A basic one concerns the question whether the current modeling formalism, a specific type of finite state machine, is appropriate. This has several aspects: First, it has to be checked whether it is

expressive enough to capture the requirements on embedded software. Second, the impact of the representation on the complexity of the algorithm has to be analyzed (Handling absolute time is an important issue, as stated below). These aspects have to be confronted with the most important guideline: appropriateness for current practice.

Our project is not an academic exercise, but aims at tools that can be easily used in the actual work process. Current requirement specifications at the development stage that matters in our context comprise mainly natural language text together with a few formal or semi-formal elements, such as state charts (provided they are written at all!). Assuming the existence of formal, executable specifications is unrealistic. Any formal representation of the requirements as we need them as an input to our tools needs to take into account whether they can be produced in the current process, by the staff given its education and background, and the limited efforts that can be spent in a real project where meeting deadlines and reducing development time has top priority. Whenever the use of new tools and additional work is required, this needs a rigorous justification by a significant pay-off (in our case in the time spent on testing and the quality of its results).

Figure 9: trajectory of the FSM for the two tests of Figure 7


On the technical side, an adequate handling of time is needed. In our example, time elapsing in a particular state (e.g. “8s with no motion”) seems to be local. However, the respective event has to be stated in a way that can be interpreted properly in other states as well, which may have been reached due to a fault. Introducing global absolute time tends to enforce using the smallest time increments required for some state and event, which appears prohibitive.

AcknowledgementsThanks to Torsten Strobel who implemented the algorithm, Oskar Dressler for discussions and support of this work, the Model-based Systems and Qualitative Modeling Group at the Technical University of Munich and the reviewers for their helpful comments. We also thank Audi AG, Ingolstadt, and, in particular, Reinhard Schieber for support of this work.

References[Esser 05] Esser, M: Modellbasierte Generierung von

Tests für eingebettete Systeme am Beispiel der Tankanzeige in einem Kraftwagen, Technical University of Munich, 2005

[Beizer95] Beizer, B.: Black-Box Testing, John Wiley and Sons, New York, NY, 1995

[McIlraith-Reiter 92] McIlraith, S., Reiter, R.: On Tests for Hypothetical Reasoning. In: W. Hamscher, J. de Kleer und L. Console (Hg.). Readings in Model-based Diagnosis: Diagnosis of Designed Artifacts Based on Descriptions of their Structure and Function. Morgan Kaufmann, San Mateo, 1992

[OCC’M 05] www.occm.de[Sampath 96] Sampath, M., Senupta, R., Lafortune, S.,

Sinnamohideen, K., Teneketzis, D.: Failure Diagnosis using Discrete Event Models. In: IEEE Transactions on Control Systems Technology, 4(2) 1996, pp. 105-124

[Struss 94] Struss, P.: Testing Physical Systems. In: Proceedings of AAAI-94, Seattle, USA, 1994.

[Struss 94a] Struss, P.: Testing for Discrimination of Diagnoses. In: Working Papers of the 5th International Workshop on Principles of Diagnosis (DX-94), New Paltz, USA, 1994.

[Struss 05] Struss, P.: Automated Test Reduction. In: B. Rinner et al. (eds.), 19th International Workshop on Qualitative Reasoning QR-05, May 18th - 20th, 2005, Graz, Austria, pp. 117-122. Also in: Dearden, R. and Narasimhan, S. (eds), 16th International Workshop on Principles of Diagnosis (DX-05), June 1-3, 2005, Monterey, California, pp. 187-192

[Vatcheva-de Jong-Mars 02] Vatcheva, I., de Jong, H., Mars, N.: Selection of Perturbation Experiments for Model Discrimination. Proceedings of ECAI-02, 2002


A Multi-Valued SAT-Based Algorithm for Faster Model-Based Diagnosis

Alexander Feldman, Jurryt Pietersma and Arjan van GemundDelft University of Technology

Faculty of Electrical Engineering, Mathematics and Computer ScienceMekelweg 4, 2628 CD, Delft, The NetherlandsTel.: +31 15 2781935, Fax: +31 15 2786632

e-mail: a.b.feldman,j.pietersma,[email protected]

Abstract

Finite integer domains offer an intuitive represen-tation of fault diagnosis models of real-world sys-tems. Approaches that encode multi-valued mod-els to the Boolean domain suffer from combina-torial explosion. Prompted by recent advances inmulti-valued SAT solving, in this paper we presenta multi-valued diagnosis algorithm. This soundand complete algorithm is based on multi-valuedSAT and A*, and does not require Boolean encod-ing. The resulting diagnostic engine is specificallydesigned to suit the characteristics of the diagno-sis search and better exploits the locality whichis present in the multi-valued variable domains ofa wide-range of model-based diagnosis problems.Results from experiments on both synthetic andreal-world problems are in agreement with recentlyreported good performance of multi-valued DPLLconsistency checkers. Models used for experimen-tation include NASA’s X-34 propulsion system andASML’s wafer scanner subsystems. The empiricalresults show that, depending on the domain size andnumber of variables, the multi-valued approach candeliver up to two orders of magnitude speedup overBoolean approaches.

1 IntroductionWhen considering a multi-valued model of the X-34 propul-sion system [Sgarlata and Winters, 1997], one option is tomodel it in the Boolean domain and to use a SAT baseddiagnosis algorithm or, alternatively, to create a model, forwhich each variable is in the finite-integer domain. The latterapproach allows for intuitive modeling of components withmultiple fault modes and discretization of continuous valuesin areas like qualitative reasoning. Within this modeling tech-nique, we face the choice of using directly a multi-valued di-agnostic algorithm or, as an alternative, to encode the modelin the Boolean domain by using an appropriate mapping.

Boolean encoding is not without a price. Many diagnos-tic (and SAT) algorithms work on a normalized representa-tion of a model (e.g., CNF). In the Boolean encoding phase,the model loses important locality information (treating the

different values of a variable in connection to each other in-creases the reasoning speed). This problem of “breakingapart” the multi-valued variables is often aggravated laterin the normalization as the Boolean encodings of a singlemulti-value variable can be “spread-through” a model afterit has been encoded. Directly using a model representedin multi-valued normal form preserves the locality while amulti-valued diagnostic algorithm retains the simplicity of aBoolean diagnosis algorithm.

The main algorithm, introduced in this paper, is a multi-valued A* search which computes the diagnoses in best-firstorder, starting with a minimal diagnosis (note that the diag-noses produced subsequently are not necessarily minimal).Thus, not all minimal diagnoses need be computed (the num-ber of all minimal diagnoses can be still exponential in thenumber of components [Vatan, 2002]).

The method, proposed in this paper, is not the only onewhich achieves speed-up due to exploiting locality. A com-plementing approach is to exploit system structure and hi-erarchy [Feldman et al., 2005; Fattah and Dechter, 1995].Reasoning in representations which are closer to the “raw”model reduces the number of pre-processing transformationsteps and, in general, achieves faster diagnosis times. Allthese techniques can be combined for achieving diagnosis inreal-time for a wide-class of real-world applications.

Surprisingly, there are few publications [Bandelj et al.,2002] concerning the application of non-boolean search al-gorithms to qualitative reasoning and model-based diagnosis.In the satisfiability field, CAMA [Liu et al., 2003] is a novelextension of the classic DPLL algorithm. The CAMA al-gorithm uses unit-clause propagation and conflict-learning toincrease the performance of the satisfiability checking. Simi-larly [Frisch and Peugniez, 2001] studies direct non-booleanstochastic local search where a well-known Boolean SAT en-gine (Walksat) is modified to handle non-boolean problems.The authors of both CAMA and NB-Walksat find their ap-proaches faster for multi-valued SAT instances with increas-ing domain-size, compared to a Boolean DPLL run on theequivalent Boolean encodings.

Encoding multi-valued problems in the Boolean domain isa technique discussed in many papers. The study of the useof both a Boolean truth maintenance system and the finitedomain approach for solving a class of constraint satisfac-tion problems (CSP) dates back to [de Kleer, 1989]. The lat-


ter paper suggests sparse Boolean encoding for finite-domaininteger solvers. Multi-valued reasoning is complementaryto other locality-based diagnosis search techniques [Provan,2001]. A study on the use of multi-valued propositionalencodings for finite-domain variables and the possibility ofcombinatorial loss-of-performance due to the increase of thepropositional theory size, however, is beyond the scope ofthat paper.

CSP-based algorithms for model-based diagnosis [Sachen-bacher and Williams, 2004; Williams and Ragno, 2004] al-ready consider multi-valued variable domains. An extensivestudy has been performed on CSP decomposition techniques[Stumptner and Wotawa, 2003] and some empirical resultsdiscussed by the same authors in [Stumptner and Wotawa,2001] show good performance results. The latter decomposi-tion technique results in faster diagnosis time due to tractabil-ity achieved by transforming the original problem to an equiv-alent one with restricted structure. Our technique differs fromthe above by the fact that it is based on multi-valued propo-sitional search, hence allowing more aggressive optimizationby borrowing learning algorithms and heuristics from the sat-isfiability domain.

The method discussed in this paper can be generalized toalmost any approach for propositional reasoning. Multiple-valued decision diagrams (MDD) [Srinivasan et al., 1990]are a natural extension to binary decision diagrams (BDD).To our knowledge the state-of-the-art compilation techniqueof Darwiche [Darwiche, 2004] has not been generalized tomulti-valued propositional logic. Hence an approach simi-lar to ours would be complementary to this method. Simi-larly, a propositional truth-maintenance system (TMS) (e.g.,[Nayak and Williams, 1998]) and Boolean model-based rea-soning systems [Frohlich and Nejdl, 1997] can benefit frombeing able to reason over non-boolean literals.

The type of Boolean encoding influences the performanceof the DPLL search and the topic is further studied in [Hoos,1999]. This paper distinguishes between compact and sparseencodings (also referred to as unary and binary in other pa-pers) and compares the performance of a state-of-the art CSPsolver and a Boolean SAT consistency checker. The resultsshow increased performance of the multi-valued approach.

A hybrid finite-domain constraint solver for circuits is sug-gested in [Parthasarathy et al., 2004]. This approach com-bines a Boolean DPLL checker and a finite-domain integerCSP solver to allow for more compact problem represen-tation, avoiding the house-keeping constraints imposed byunary or binary encodings (where the number of variables isnot a power of 2). This results in a faster solver which haswide application including consistency-based fault diagnosis.

The algorithm described in this paper provides a sound andcomplete method for computing diagnoses in best-first order.Heuristic functions working on multi-valued representationsare provided with additional locality information in compari-son to their counterparts working on encoded Boolean mod-els. The extra search dimension which is added by the multi-valued domain of the model variables facilitates faster com-putation of leading diagnoses. In particular, our method iscompared to Boolean algorithms working on both sparse anddense encodings. Sparse encoding leads to combinatorial ex-

plosion even with small models, while dense encoding, stillslower than the multi-valued approach, imposes difficultieson constructing efficient heuristic functions. All this moti-vates the use of the direct multi-valued reasoning describedbelow.

The remainder of the paper is organized as follows. In Sec-tion 2 we introduce some basic terminology and show a multi-valued consistency checking algorithm, which will be usedlater in the diagnosis process. In Section 3 we describe themain diagnostic algorithm and illustrate its workings with asmall example. Section 4 contains experimental performanceresults. Finally, conclusion, notes and future works are pre-sented.

2 Multi-Valued SatisfiabilityThe technique presented in this paper searches for a diagnosisby checking for consistency of a possible health assignment,the system description, and the observation, while discardingthe states which are inconsistent. The consistency check isdone using a DPLL-based algorithm in the multi-valued do-main. The main difference between the multi-valued DPLLand its well-known Boolean counterpart comes from the factthat a multi-valued consistency checking routine can branchon more than two values. Before we show this multi-valuedvariant of DPLL and discuss the opportunities for its opti-mization, we introduce the basic terminology necessary formulti-valued satisfiability.

Definition 1 (Multi-Valued Literal). A multi-valued vari-able vi ∈ V takes a value from a finite domain, which isan integer set Di = 1, 2, . . . , m. A positive multi-valuedliteral l+j is a Boolean function l+j ≡ (vi = dk), wherevi ∈ V, dk ∈ Di. A negative multi-valued literal l−j is aBoolean function l−j ≡ (vi = dk), where vi ∈ V, dk ∈ Di. Ifnot specified, a literal lj can be either positive or negative.

Note, that a variable vi can assume at most one value or itscomplement in Di. The need for having negative literalscomes from the fact that frequently in models, the processof converting to multi-valued CNF results in many negations.For a variable with sufficiently large domains using this nota-tion instead of all complementary values leads to significantlyshorter formulae. Next we introduce a multi-valued conjunc-tive normal form (CNF) which will be our representation forthe diagnostic problem1.

Definition 2 (Multi-Valued CNF). A multi-valued conjunc-tive normal form is a conjunction of disjunctions of multi-valued literals, that is C = σ1 ∧ σ2 ∧ . . . ∧ σn and σi =li1 ∨ li2 ∨ . . . ∨ lik for i = 1 . . . n.

Throughout this text we will also use an alternative notationfor a formula in CNF – a clausal set. In this case the clausalset is a set of clauses and each clause is a disjunction of multi-valued literals.

Definition 3 (Multi-Valued Assignment). A multi-valuedassignment φ or a multi-valued term is a conjunction of multi-valued literals, that is φ = l1 ∧ l2 ∧ . . . ∧ lp.

1Translation from multi-valued propositional logic to multi-valued CNF is well-studied and we will omit it for brevity.


Obviously, we can convert a multi-valued assignment to aclausal set in which each clause contains a single literal, andvice-versa, by using the De Morgan’s law: l1∧ l2∧ . . .∧ lp =¬l1 ∨ ¬l2 ∨ . . . ∨ ¬lp. The latter proves useful when, later inAlgorithm 2, we have to add an observation represented as anassignment to the clausal set of the system description.

If an assignment contains all the variables in a multi-valuedCNF we will call it full, otherwise it is partial. Next we de-scribe an algorithm, which given a multi-valued CNF C re-turns True iff there is an assignment (a conjunction of liter-als) φ such that C |= φ.

The worst-case time complexity of Algorithm 1 is expo-nential of the number of variables in φ. Formulae frommodel-based diagnosis, however, are highly structured andrarely expose this worst-case performance. If we consider allthe domain sets Di ∈ D, the one with the highest cardinalityis dmax = argmaxDi∈D |Di|, that is dmax is the number ofvalues in the largest variable domain in φ. The space com-plexity of Algorithm 1 is then O(|V | × dmax), where |V | isthe number of variables in φ.

Algorithm 1 Multi-valued DPLL consistency checking.function MVSAT(C, V, D, φ)inputs: C, a clausal-set

V = v1, v2, . . . , vq, a set of variablesD = D1, D2, . . . , Dq, a set of domain setsφ, an assignment, initially empty

if (φ← φ ∧ UNITPROPAGATE(C, φ)) |=⊥ thenreturn False

end if5: if ∃σ ∈ C, σ ∧ φ |=⊥ then

return Trueend iffor all vi ∈ V, k ∈ Di : l ≡ (vi = k), l ∈ φ do

if MVSAT(C, V, D, φ ∧ l) = T then10: return True

end ifend forreturn False

end function

The workings of Algorithm 1 is very similar to the originalBoolean algorithm, except that it branches for every possi-ble value of the selected variable vi. An important part ofthe algorithm is the unit propagation, implemented in theUNITPROPAGATE routine. Note that, unlike in the Booleancase, unit-propagation works only for clauses which are unit-open with a free positive literal (in the Boolean case we canassign a value to negative literals as well).

Algorithm 1 is subject to the same optimization techniquesas the Boolean DPLL. An important source of speed-up canbe conflict learning2. In addition to the classical Boolean

2Actually, in the cases of inconsistent input we need a conflictextraction mechanism, the result of which is to be used for pruningthe diagnostic tree in Algorithm 2. Due to the limited scope of thispaper we will not discuss mechanisms for conflict extraction.

speed-up methods, an additional source of speedup would bethe use of various heuristic techniques which are possible dueto the extra search dimension caused by the multiple values.When choosing a value for a given variable, for example, it ispossible to select (either dynamically for the remaining val-ues and clauses or statically) the one which will satisfy thebiggest number of clauses.

Algorithm 1 has the potential of determining faster satisfi-ability in comparison to a Boolean DPLL running on an en-coding. Consider the example formula C = (((x = 1)∨(y =2))∧ ((x = 3)∨ (y = 1))), with the domains of the variablesx and y, Dx = 1, 2, 3 and Dy = 1, 2, respectively. Asparse Boolean encoding would map x = 1 to x1, x = 2 tox2, x = 3 to x3, y = 1 to y1 and, finally, y = 2 to y2. In addi-tion to that it needs to impose the constraints x1 = x2, x1 =x3, x2 = x3 and y1 = y2. The sparsely encoded Boolean for-mula is Cs = (x1∨y2)∧(x3∨¬y1)∧(x1∨x2∨x3)∧(¬x1∨¬x2)∧(¬x1∨¬x3)∧(¬x2∨¬x3)∧(y1∨y2)∧(¬y1∨¬y2).While the multi-valued formula C has 2 variables with 2 and3 values each and total of 2 clauses, its Boolean encoding Cs

has 5 variables and 8 clauses. In the multi-valued case we candetermine the satisfiable assignment (x = 2) ∧ (y = 2) in 4steps only (ignoring the effect of the unit propagation), whilein the Boolean encoding we have to perform 6 recursive calls.

3 Diagnosis of Multi-Valued ModelsModeling of physical artifacts and representing them by usingfirst-principles is a topic on its own and in this paper we willassume that a correct model is converted to a multi-valuedCNF and what remains to perform is the computationally in-tensive task of finding all possible explanations of a givenobservation. In order to suggest such an algorithm we firstformalize the notion of a diagnostic problem and diagnosis.

Definition 4 (Diagnostic Problem). A multi-valued diag-nostic problem DP is defined as the ordered triple DP =〈SD, H, OBS〉, where SD = 〈V, D, C〉 is the set of variables,their domains and a system description represented as a multi-valued CNF, H is a set of health variables such that H ⊂ Vand the observation OBS is a variable assignment over someof the variables in V \H .

If a device is malfunctioning then assigning nominal values(let us name such a nominal assignment x) to all the healthvariables of its model manifests an inconsistency with the ob-servation, i.e., OBS∪ SD∪ x |=⊥. A sound and complete di-agnostic algorithm has to find all assignments which explainthe observation OBS, that is ∀x : x ∈ X, OBS∪ SD ∪ x |=⊥.

Definition 5 (Multi-Valued Diagnosis). A diagnosis or par-tial diagnosis for the system DP = 〈SD, H, OBS〉, SD =〈V, D, C〉 is the assignment x = l1 ∧ l2 ∧ . . . ∧ ln such thatSD ∧ OBS ∧ x |=⊥.

The central problem of model-based diagnosis is that for nhealth variables we may have as much as 2n possible diag-noses. In the multi-valued domain, the complexity is evenworse due to the multi-valued domains of the variables, i.e.,it becomes O(dn), where d is the cardinality of the biggestdomain set. Although the number of diagnoses depends onthe model (e.g., if it is a weak or strong fault model) and the


size of the observation, we rarely need all diagnoses. An in-formed search strategy such as A* is a suitable method forcomputing only these diagnoses which optimize an objectivefunction g. Such an approach allows us to stop the diagnosticalgorithm after we have found the first N diagnoses maximiz-ing g.

As a heuristic function, we typically employ a greedy es-timator of the probability of an assignment φ. Let us assigna-priori probabilities to all possible values of all health vari-ables. The a-priori probability function of a health variablehi we define as pi(hi), where 0 ≤ pi(x) ≤ 1, x ∈ Di and∑

x∈Dipi(x) = 1. The probability estimator of a health

assignment φ = l1 ∧ l2 . . . ∧ . . . lk is defined as g(φ) =p1(h1)p2(h2) . . . pn(hn), where h1, h2, . . . , hn are all healthvariables hi ∈ H and:

p(hi) =

⎧⎨

⎩

pi(lj) if l+j ∈ φ

1− pi(lj) if l−j ∈ φarg maxk∈Di

pi(k) otherwise(1)

Next we show the actual multi-valued A* algorithm formodel-based diagnosis.

Algorithm 2 Multi-Valued A* Diagnosis.procedure MVA*(SD, OBS)inputs: SD = 〈V, D, C〉, a system model

OBS, an observation termlocal variables: Q, a priority queue

x, a health assignment

PUSH(Q, INITIALSTATE())while (x←POP(Q)) |= ⊥ do

if MVSAT(SD ∪ OBS ∪ x, V, D, ∅) then5: if ISFULLDIAGNOSIS(h) then

OUTPUTDIAGNOSIS(x)ENQUEUESIBLINGS(Q, SD ∪ OBS, x)

elsePUSH(Q,CHILDSTATE(SD ∪ OBS, x))

10: end ifelse

ENQUEUESIBLINGS(Q, SD ∪OBS, x)end if

end while15: end procedure

An implementation of (1) is used for the ordering in the prior-ity queue Q. For the manipulation of the nodes in this queue(each node is an ordinary multi-valued health assignment) weuse the standard functions PUSH and POP, the latter returning⊥ when the queue is empty.

The queuing of the nodes is performed by the CHILD-STATE and ENQEUESIBLINGS routines. The former extendsa partial assignment with its most probable descendant andthe latter pushes on the queue the most probable siblings ofeach ancestor of the current node. Note that due to the factthat the search is non-systematic we have to keep a list ofthe visited nodes (this can be organized somewhat more opti-mally in a trie).

As in many cases diagnostic models can be over- or under-constrained (depending on the modeling technique), Algo-rithm 2 can be extended with conflict-learning mechanism forpruning parts of the search tree [Williams and Ragno, 2004].Finding the set of all minimal conflicts is a problem whichis itself NP-hard, hence such a technique has no choice butto perform a limited amount of analysis (e.g., through resolu-tion) for finding conflicts of good quality.

We will illustrate the advantages of the multi-valued diag-nosis with a simple model of a valve. The health state of thevalve we denote as the health variable h, the control variablec denotes the commanded position of the valve, and for theinput and output we use i and o, respectively.

The domains of the four variables h, c, i, and o, are Dh =1 (nominal), 2 (stuck open), 3 (stuck closed), 4 (unknown),Di = Do = 1 (low pressure), 2 (high pressure), and Dc =1 (open), 2 (close), respectively. Additionally, we definethe a-priori probability estimator of h as p(h = 1) = 0.9,p(h = 2) = p(h = 3) = 0.045, and p(h = 4) = 0.01. Themodel is encoded as the multi-valued propositional formula:

M =

8><

>:

(h = 1) ∧ (c = 1) ⇒ (o = i)(h = 1) ∧ (c = 2) ⇒ (o = 1)(h = 2) ⇒ (o = i)(h = 3) ⇒ (o = 1)

(2)

Next we convert M to a clausal set and the result is the fol-lowing:

C =

8>>>>><

>>>>>:

((c = 2) ∨ (i = 2) ∨ (o = 1) ∨ (h = 1))((c = 2) ∨ (i = 1) ∨ (o = 2) ∨ (h = 1))((c = 1) ∨ (o = 1) ∨ (h = 1))((i = 2) ∨ (o = 1) ∨ (h = 2))((i = 1) ∨ (o = 2) ∨ (h = 2))((o = 1) ∨ (h = 3))

(3)

Let us assume an observation OBS = (c = 1) ∧ (i = 2) ∧(o = 1), that is the valve is stuck-open. The first step ofAlgorithm 2 would be to check the health assignment x1 =(h = 1) which has the highest probability estimator P (x1) =0.9. Algorithm 1 will determine that SD ∪ OBS ∪ x1 |=⊥,hence we have to try the second-probable health-assignmentx2 = (h = 2) and in this case we have SD ∪ OBS ∪ x2 |=⊥.In this case x2 is a diagnosis and due to the admissibility ofthe heuristics involved, we can pronounce x2 to be the most-likely diagnosis.

Next, we consider the sparse Boolean encoding of C (nor-mally, we would encode M as Ms and convert Ms to CNFwhich is even worse than encoding directly C to Cs) in (3).To preserve the order in which we generate diagnoses, weassign a probability estimator p′(hn) to each Boolean healthvariable hn encoding a state h = n. We define p(hn) =p(h = n) and p(¬hn) = 1−p(h = n). It is easy to show thatsuch an assignment of the probability estimators preserves theorder in which diagnoses are generated.

Cs =

8>>>>>>><

>>>>>>>:

(c2 ∨ i2 ∨ o1 ∨ ¬h1) ∧ (c2 ∨ i1 ∨ o2 ∨ ¬h1)(c1 ∨ o1 ∨ ¬h1) ∧ (i2 ∨ o1 ∨ ¬h2)(i1 ∨ o2 ∨ ¬h2) ∧ (o1 ∨ ¬h3)(c1 ∨ c2) ∧ (¬c1 ∨ ¬c2) ∧ (i1 ∨ i2)(¬i1 ∨ ¬i2) ∧ (o1 ∨ o2) ∧ (¬o1 ∨ ¬o2)(h1 ∨ h2) ∧ (h1 ∨ h3) ∧ (h2 ∨ h3)(¬h1 ∨ ¬h2) ∧ (¬h1 ∨ ¬h3) ∧ (¬h2 ∨ ¬h3)

(4)

A dense encoding of C results in a shorter representation,having the same number of clauses as in the multi-valued


CNF:

Cd =

8>>>>><

>>>>>:

(h1 ∨ h2 ∨ c ∨ i ∨ ¬o)(h1 ∨ h2 ∨ c ∨ o ∨ ¬i)(h2 ∨ h1 ∨ ¬c ∨ ¬o)(h2 ∨ i ∨ ¬h1 ∨ ¬o)(h2 ∨ o ∨ ¬h1 ∨ ¬i)(h1 ∨ ¬h2 ∨ ¬o)

(5)

While in this example Cd is as compact as CM , we havetwice as many health variables and we would need 8 insteadof 4 consistency checks if we were to add an extra healthstate for h in (2). The dense encodings expose another signif-icant problem with representing the probability estimator asthe health variables h1 and h2 are not independent. Anotherdisadvantage of the dense encoding is apparent in the caseswhen the number of states for a multi-valued variable is nota power of 2 (i.e., k = 2n for n ∈ N, where n is the numberof states of a variable v). In this case additional constraintsshould be added or 2n − k states would use two Boolean en-codings per state (such health encodings, however, would re-sult in 2n− k cases in which the same diagnosis is computedtwice, necessitating the storing of the already generated diag-noses).

A Boolean A* diagnosis3 on the sparse-encoded model hasto perform 23 consistency checks in (4) for finding all diag-noses, while Algorithm 2 computers all of them with 3 checksonly (in addition to that multi-valued consistency checking ismore efficient as we have seen in Section 2). With denseencodings we need 4 consistency checks which is still lessefficient than the multi-valued case.

The performance difference is better illustrated when weare interested in the first (most-likely) diagnosis only. In thiscase the multi-valued algorithm needs 2 consistency checksover a clausal set comprising 6 clauses, the sparse encodingsallow computation of a leading diagnosis with 5 checks over18 clauses and the dense encodings do not facilitate an im-plementation of a heuristic function preserving the probabil-ity order. The number of clauses in Cs is 18 due to the extrainequality constraints which additionally delays the Booleanreasoning. As we are going to show in the next section, thesedifferences translate to significant savings (in favor of themulti-valued approach) with bigger models.

4 Experimental Performance EvaluationTo demonstrate the improved performance of the multi-valued solver we have compared it experimentally with theperformance of a Boolean solver for sparse and dense en-coded models. In these experiments we use the diagnosistime t as performance metric. This is the processing timerequired to generate N diagnoses, for given OBS. It is mea-sured by the diagnosis engines as CPU time in milliseconds.All the experiments described in this paper are performed ona 3 GHz Pentium IV CPU.

The algorithm discussed in this section were performed asa part of the LYDIA model-based diagnosis toolkit4. The

3Note that (4) can be considered as a multi-valued diagnosisproblem with two values per variable.

4The LYDIA package for model-based fault diagnosis can bedownloaded from http://www.fdir.org/lydia/.

benchmark models and test-vectors are available upon re-quest. The multi-valued polycell models, discussed belowwere synthetically generated. Our experimentation could fur-ther benefit from the existence of a scalable multi-valuedbenchmark for model-based diagnosis.

For a multi-valued problem and its sparse and and denseencodings, we denote t with tm, ts, and td respectively.The speedups ss and sd of the multi-valued search over theBoolean sparse and dense encodings respectively, are calcu-lated as ss = ts/tm and sd = td/tm. Let W denote thenon-observable, non-health variables, W = V \ (OBS ∪H).We investigate t, ss and sd in relation to the domain size |Di|and model complexity.

As discussed in the valve example, |Di| and the type ofencoding determine the number of clauses and consistencychecks which affect t. Hence we investigate the relation be-tween t and |Di|. We use the well-known synthetic, integerdomain Polycell model [de Kleer and Williams, 1987] whichallows for practical and meaningful variation of |D i|. Let|Di| = dH , ∀hi ∈ H and |Di| = dW , ∀wi ∈ W . Withthe Polycell we perform experiments for dH = 2, 3, . . . , 9and dW = 2, 3, . . . , 42, for nominal OBS with N equal to themaximum number of consistent solutions K .

We perform similar performance experiments on real-world models to investigate whether the expected improvedperformance of the multi-valued approach also holds for prac-tical applications. We use nine models of variations of ASMLwafer scanner subsystems and one of NASA’s X-34 propul-sion system. In contrast to Polycell, these models have vary-ing values of |Di| for H and W . Therefore we cannot inves-tigate a one-to-one relationship of t with |D i|. Instead, weinvestigate the dependencies between t and the model com-plexity. We use S, the number of all possible value assign-ments of H and W , as measure for this complexity. LetSH =

∏hi∈H |Di|, hi ∈ H , SW =

∏wi∈W |Di|, wi ∈ W ,

and S = SHSW . Table 1 shows these numbers for all real-world models and variations.

Model SH SW S

ASML1A1 2241 2237 1.39E + 05ASML1A2 2241 24313 4.08E + 08ASML1A3 2241 26319 1.19E + 12ASML1B1 2643 2237 3.58E + 07ASML1B2 21246 24313 4.28E + 14ASML1B3 21849 26319 5.11E + 21ASML2A 210 21531410 1.06E + 14ASML2B 210 23032420 1.09E + 25ASML2C 210 24533430 1.12E + 36X-34 226412 320 3.93E + 24

Table 1: SH , SW , and S for 10 real-world models and variations.

We consider three scenarios with OBS caused by nomi-nal, single fault, and double faults for N = 1 and N =min(|H | , K), respectively. Because we do not currently havea probability heuristic in place for the dense encoding, its di-agnoses is unsorted. Hence comparison with the sorted multi-valued diagnosis for N < K , is not always correct.


For the Polycell model Figure 1 shows a logarithmic plot oft against dW with dH = 2, for all encodings. For dW → 42,ss ≈ 6 and sd ≈ 1, the latter indicating similar performance.The down-ward spikes in the dense encoding plots are due tothe efficient dense encoding for log2 dW ∈ N. Experimentsperformed with non-nominal or fewer observations show sim-ilar results.

1

10

100

1000

5 10 15 20 25 30 35 40 45

[ms]

dW

tstdtm

Figure 1: Time for computing of all diagnoses for Polycell modelswith dH = 2 and a nominal observation vs. variable dW .

Similarly Figure 2 shows the results for dH = 3. For dW →42, ss ≈ 26 and sd ≈ 3. Thus, except for the spikes atdW = 16 or dW = 32 the multi-valued approach now alsoclearly outperforms the dense encoding. This is due to theinefficiency of dense encodings for log2 dH /∈ N.

10

100

1000

10000

100000

5 10 15 20 25 30 35 40 45

[ms]

dW

tstdtm

Figure 2: Time for computing of all diagnoses for Polycell modelswith dH = 3 and a nominal observation vs. variable dW .

Figure 3 shows a double logarithmic plot of t against dH .We omit sparse encodings because ts for dH > 3 becomesprohibitively large for practical experiments. The increase oftd is exponential ∝ log2 dH, hence the staircase shape ofthis plot. Let γd(dH) be the increase according to,

γd(dH) =td(dH |log2 dH = k)

td(dH |log2 dH = k − 1), k = 2, 3, 4 (6)

For k = 2, 3, 4, γd(dH) ≈ 17, 21, 31. The straight line formulti-valued encodings agrees to t ∝ dH

5. The plot shows

that for dH < 9 and log2 dH ↑ log2 dH, sd < 1, i.e., betterperformance for dense encoding.

As, tm and td for dH > 9 become prohibitively large forexperiments we need to extrapolate their relations with dH tocompute s,

sd =γd(dH)log2

dH

dH5 (7)

If, e.g., we consider a conservative approach and assumeγd(dH) = 31, then 0.8 ≤ sd ≤ 23.9, for dH ≤ 256. If, asexpected from the experimental results γd(dH) continues toincrease, then for γd(dH) > 25, 1.0 < sd and the speedupof multi-valued over dense encoding will not have an up-per bound. Thus multi-valued encoding also has unboundedspeedup for dense encodings.

1e+02

1e+03

1e+04

1e+05

1e+06

1e+07

2 4 8

[ms]

dH

tdtm

Figure 3: Double logarithmic plot of time for computing of all di-agnoses for multi-valued and dense encoded Polycell models withdW = 42 and a nominal observation vs. variable dH .

For the real-world models, Figure 4 shows the double log-arithmic plot for t vs. S for N = 1 for the multi-valued solverfor nominal, single, and double faults. As expected, it showsan increasing trend for t vs. S, and for increased cardinalityof the faults. Figure 5 shows ss and sd. Note that data pointsabove the thick line s = 1 indicate an actual speedup of themulti-valued approach. In the nominal case, ss < 1 which iscaused by a larger overhead of the multi-valued approach forN = 1. The effect of this overhead disappears for single anddouble faults where 101 < ss < 104. Because of the omittedprobability heuristic mentioned earlier, the analysis of sd isless straightforward. We note that in roughly half the casesthe speedup is similar to the sparse encoding.

Figures 6 and 7 show t and s for N = min(|H | , K). AsN ≥ 1 the initial overhead effect for the multi-valued ap-proach is amortized over the multiple diagnoses resulting in101 < ss < 104 for nominal, single, and double faults. Forsd, it is interesting to see that the effect of the omitted proba-bility heuristic is most apparent for the double faults. This canbe explained from the fact that in these cases the solver gen-erates many-fault solutions with low probability faster thanthe multi-valued solver generates true double-fault solutions.The latter have higher t because of more constrained consis-tency checking.


0.1

1

10

100

1000

10000

1e+05 1e+15 1e+25 1e+35

t m[m

s]

S

nominal healthsingle fault

double fault

Figure 4: Double logarithmic plot of diagnosis time with N =1 and observations consistent with different fault cardinalities vs.variable model complexity.

0.01

0.1

1

10

100

1000

10000

1e+05 1e+15 1e+25 1e+35S

ss, F = 0sd, F = 0

ss, F = 1sd, F = 1

ss, F = 2sd, F = 2

Figure 5: Double logarithmic plot of speedup with N = 1 andobservations consistent with nominal health (F = 0), single fault(F = 1), and double fault (F = 2) vs. variable model complexity.

In summary of the experiments, the speedup of multi-valued approach over the sparse encoding is demonstratedclearly both in relation to the domain size and number ofstates. Speedups of 102 are readily achieved. The sameconclusion holds for dense encodings as far as the domainsize experiments are concerned. Especially for larger do-main sizes the speedup is considerable. For smaller domainsizes closer proximity to 2i, i ∈ N means better performancefor the dense encodings. Because of the lacking probabilityheuristic, the relation of sd with S remains somewhat unclear,despite the fact that sd > 5 in many cases.

5 ConclusionThis paper introduces a new algorithm for computing diag-noses which works directly on the multi-valued representa-tion of a model. The sound and complete algorithm com-prises a DPLL-based multi-valued consistency checking anda multi-valued A* search. The two routines eliminate theneed of model encodings, thus preventing loss of locality in-

1

10

100

1000

10000

100000

1e+06

1e+05 1e+15 1e+25 1e+35

t m[m

s]

S

nominal healthsingle fault

double fault

Figure 6: Double logarithmic plot of diagnosis time with N =min(|H | , K) and observations consistent with different fault cardi-nality vs. variable model complexity.

0.01

0.1

1

10

100

1000

10000

1e+05 1e+15 1e+25 1e+35S

ss, F = 0sd, F = 0

ss, F = 1sd, F = 1

ss, F = 2sd, F = 2

Figure 7: Double logarithmic plot of speedup with N =min(|H | , K) and observations consistent with nominal health(F = 0), single fault (F = 1), and double fault (F = 2) vs. variablemodel complexity.

formation which often leads to performance degradation.In contrast to dense Boolean encoding, the multi-valued

A* algorithm allows an intuitive assignment and interpreta-tion of a-priori probabilities, which combined with the greedyA* search, described in this paper, allows for a precise con-trol over the termination criteria for the diagnostic computa-tion. This allows the generation of only these leading diag-noses that contain significant probability mass (thus turningthe search into incomplete). While sparse Boolean encod-ing is more suitable than dense encoding for heuristic searchbased on a-priori probability, the combinatorial explosionscaused by the introduction of new variables makes it suitablefor the tiniest diagnosis problems only.

We have empirically compared the performance of the al-gorithm to sparse and dense Boolean encodings. These exper-imental results confirm our analysis and show that the multi-valued search outperforms both types of encoding, in particu-lar being two orders of magnitude faster than sparse Boolean


encoding. While dense encoding is faster than sparse encod-ing (but still slower in comparison to the multi-valued ap-proach), it is less amenable to heuristics based on a-priorihealth probability estimation, widely used in model-based di-agnosis.

In future work, we aim to address this estimator problem,allowing the three methods to be compared not only in thecases of generating all diagnoses but also when computingthe first N leading diagnoses. In this case we would be able toanalyze how the probability assignment influences the diag-nostic search. Last, we note the need of representative multi-valued benchmarks which would enhance the experimentalresults of this paper.

Furthermore, we envision our algorithm used in combina-tion with other reasoning-methods which result in improvedperformance by using locality. Analysis [Ramesh et al.,1997] and experience has shown that the higher-level rea-soning engine we use (by higher-level we mean a diagnosticengine which uses a model representation closer to the orig-inal model), the faster performance results we get. In the fu-ture, we would be interested in combining, e.g., non-clausalsearch methods, hierarchical search [Feldman et al., 2005],and the multi-valued approach described here for a more effi-cient model-based diagnosis and related diagnostic reasoning.

AcknowledgmentsWe extend our gratitude to the anonymous reviewers for theirvaluable feedback.

References[Bandelj et al., 2002] A. Bandelj, I. Bratko, and D. Suc.

Qualitative simulation with CLP. In Proc. QR’02, 2002.

[Darwiche, 2004] Adnan Darwiche. New advances in com-piling CNF into DNNF. In Proc. ECAI’04, pages 328–332,2004.

[de Kleer and Williams, 1987] J. de Kleer and B. Williams.Diagnosing multiple faults. JAI, 32(1):97–130, 1987.

[de Kleer, 1989] J. de Kleer. A comparison of ATMS andCSP techniques. In Proc. IJCAI’89, pages 290–296, 1989.

[Fattah and Dechter, 1995] Yousri El Fattah and RinaDechter. Diagnosing tree-decomposable circuits. InIJCAI’95, pages 1742–1749, 1995.

[Feldman et al., 2005] A. Feldman, A.J.C van Gemund, andA. Bos. A hybrid approach to hierarchical fault diagnosis.In Proc. DX’05, pages 101–106, 2005.

[Frisch and Peugniez, 2001] Alan Frisch and T. Peugniez.Solving non-boolean satisfiability problems with stochas-tic local search. In Proc. IJCAI’01, pages 282–290, 2001.

[Frohlich and Nejdl, 1997] Peter Frohlich and Wolfgang Ne-jdl. A static model-based engine for model-based reason-ing. In Proc. IJCAI’97, pages 466–473, 1997.

[Hoos, 1999] H. Hoos. SAT-encodings, search space struc-ture, and local search performance. In Proc. IJCAI’99,pages 296–303, 1999.

[Liu et al., 2003] C. Liu, A. Kuehlmann, and M. Moskewicz.CAMA: A multi-valued satisfiability solver. In Proc. IC-CAD’03, pages 326–333, 2003.

[Nayak and Williams, 1998] Panduarng Nayak and BrianWilliams. Fast context switching in real-time propositionalreasoning. In Proc. AAAI’98, pages 50–56, 1998.

[Parthasarathy et al., 2004] G. Parthasarathy, M. Iyer, K.T.Cheng, and L. Wang. An efficient finite-domain constraintsolver for circuits. In Proc. DAC’04, pages 212–217, 2004.

[Provan, 2001] G. Provan. Hierarchical model-based diag-nosis. In Proc. DX’01, pages 167–174, 2001.

[Ramesh et al., 1997] A. Ramesh, G. Becker, and N. Murray.CNF and DNF considered harmful for computing primeimplicants/implicates. JAR, 18(3):337–356, 1997.

[Sachenbacher and Williams, 2004] Martin Sachenbacherand Brian Williams. Diagnosis as semiring-based con-straint optimization. In Proc. ECAI’04, pages 873–877,2004.

[Sgarlata and Winters, 1997] P. Sgarlata and B. Winters. X-34 propulsion system design. In Proc. AIAA Joint Propul-sion Conference and Exhibit, 1997.

[Srinivasan et al., 1990] Arvind Srinivasan, Timothy Kam,Sharad Malik, and Robert Brayton. Algorithms for dis-crete function manipulation. In Proc. ICCAD’90, pages92–95, 1990.

[Stumptner and Wotawa, 2001] Markus Stumptner andFranz Wotawa. Diagnosing tree-structured systems. JAI,127(1):1–29, 2001.

[Stumptner and Wotawa, 2003] Markus Stumptner andFranz Wotawa. Coupling CSP decomposition methodsand diagnosis algorithms for tree structured systems. InProc. IJCAI’03, pages 388–393, 2003.

[Vatan, 2002] F. Vatan. The complexity of the diagnosisproblem. Technical Report NPO-30315, Jet PropulsionLaboratory, California Institute of Technology, 2002.

[Williams and Ragno, 2004] Brian Williams and R. Ragno.Conflict-directed A* and its role in model-based embed-ded systems. JDAM, 2004.


A general method for diagnosing axioms

Gerhard Friedrich, Stefan Rass, and Kostyantyn ShchekotykhinUniversitaet Klagenfurt, Universitaetsstrasse 65, 9020 Klagenfurt, Austria, Europe

[email protected]

Abstract

Full support of debugging knowledge bases mustnot stop at the level of axioms. In this work,we present a general theory for diagnosing faultyknowledge bases, which not only allows the iden-tification of faulty axioms, but also the pinpointingof those parts of axioms, which must be changed.Based on our theory, we present methods for com-puting these diagnoses and show the feasibility byextensive test evaluations. The proposed approachis broadly applicable, since it is independent of aparticular logical language (with monotonic seman-tics) and independent of a particular reasoning sys-tem.

1 IntroductionA broad adoption of knowledge based systems requires ef-fective methods to support the test and debug cycle of knowl-edge bases. In this cycle the knowledge engineer has to diag-nose the knowledge base (KB) in order to identify those parts,which must be changed such that the intended behavior isachieved. This task becomes challenging even in moderatelysized knowledge bases with hundreds of axioms. Therefore,considerable research effort[Schlobach and Cornet, 2003;Parsionet al., 2005; Friedrich and Shchekotykhin, 2005;Haaseet al., 2005; Wanget al., 2005] was put into the im-provement of debugging. All these approaches have in com-mon that they either are not based on a well-founded diagno-sis theory or consider an axiom as the smallest entity, whichcould be faulty. Consequently, no general diagnosis approachexists that identifies those parts of axioms which must bechanged such that all tests and requirements for a KB are ful-filled. Diagnosing axioms becomes especially important inknowledge bases, where axioms are of remarkable size, e.g.the Galen ontology1 comprises axioms with more than 100arguments in a logical operator. In this case, debugging isstill difficult even if diagnosis provides a set of axioms thatneed further investigation by the knowledge engineer.

1A test version is included in a benchmark suitefor description logic, e.g. see RACER’s version athttp://www.racer-systems.org

Consequently, an important research question is, whether itis possible to develop a general theory and practically applica-ble algorithms for the diagnosis of knowledge bases, whichimprove the resolution of diagnoses by identifying faultyparts of axioms. Such an algorithm could be very support-ive as an interactive debugging tool for knowledge engineerswho want to inspect diagnoses and axioms on a more finegrained resolution level.

We address this question by the development of a generaltheory for the diagnosis of knowledge bases, which are ex-pressed in a declarative knowledge representation language.This theory not only allows the identification of faulty ax-ioms, but also the faulty parts of axioms. Furthermore, thistheory is applicable for all declarative knowledge represen-tation languages, which are based on a variant of first-orderlogic (FOL), e.g. description logic (DL) or the OWL languagefamily, which plays an important role for the implementationof the Semantic Web. We will show that this theory is a gen-eralization of the diagnosis theory of[Friedrich and Shcheko-tykhin, 2005], which on one hand provides a general theoryfor diagnosing knowledge bases but on the other hand con-siders just axioms as the finest granularity of diagnoses. Forthe implementation, we employ a transformation of the ax-ioms in a set of axioms such that the original diagnosis algo-rithms presented in[Friedrich and Shchekotykhin, 2005] canbe applied. As a consequence, this approach provides a soundand complete method for the generation of diagnoses for ax-ioms independently of a particular reasoning system. There-fore, we are not limited to special properties of the knowl-edge bases (e.g. acyclic) or restrictions of the representationlanguage. We present enhancements of this algorithm, whichlead to considerable improvements of the running time for thediagnosis of axioms. Finally, we show the feasibility of ourmethods by exploiting a standard test library.

In the following Section, we present our basic idea to di-agnose axioms. In Section 3, we introduce a general theoryfor diagnosing axioms and relate this theory to the existingtheory of diagnosing knowledge bases. The foundation forcomputing axiom diagnoses is given in Section 4, followedby an evaluation in Section 5. The paper closes with a discus-sion of related work and final conclusions.


2 Limitation of current approachFor the introduction of our concepts, we consider the DLknowledge base2 bike2.tkb, available through the bench-mark suite for the RACER system. This KB comprises 154axioms. Let us assume that in the knowledge acquisitionprocess one of the axioms was incorrectly stated. In our ex-emplification, the axiom defining conceptC13 is incorrectlyspecified, i.e. in the depicted Axiom 66 the correct expression(SOME R11 C75) has been replaced by the faulty expres-sion(ALL R11 C75):

[66:] (DEFINE-CONCEPT C13(AND (SOME R22 C63)(SOME R11 C74) (ALL R11 C75)(AT-MOST 3 R11) (AT-LEAST 2 R11)(SOME R14 *TOP*) (SOME R30 *TOP*)(AT-LEAST 2 R19) (SOME R4 *TOP*)(SOME R23 *TOP*) (SOME R2 *TOP*)))

As a consequence the KBbike2.tkb becomes incoher-ent because of Axiom 66 and the axioms

[145:](IMPLIES C74 (NOT C75))[146:](IMPLIES C75 (NOT C74))

A knowledge base is incoherent, iff there exists a con-cept or role which is incoherent. A concept or role is in-coherent, iff it has an empty extension in all models. Inour exampleC13 is incoherent inbike2.tkb. The setsof axioms〈66, 145〉 and〈66, 146〉 are the minimal in-coherent subsets of the example KB. Such sets are calledminimal conflictsin the terminology of model-based diagno-sis [Friedrich and Shchekotykhin, 2005]. As expected, thediagnosis-engine of[Friedrich and Shchekotykhin, 2005] cor-rectly returns Axiom66 as the only single fault diagnosis.More formally, a KB-Diagnosis problem is defined as fol-lows.

Definition 1 (KB-Diagnosis Problem) A KB-DiagnosisProblem (Diagnosis Problem for a Knowledge Base,[Friedrich and Shchekotykhin, 2005]) is a tuple (KB, B,TC+, TC−), where KB is a knowledge base, B is abackground theory,TC+ is a set of positive andTC− a setof negative test cases, which theKB has to be consistent orinconsistent with, respectively. The test cases are given assets of logical sentences. We assume that each test case andthe background theory on its own are consistent.

The principal idea of the following definition of a diagnosisfor a KB is to find a set of axioms that must be changed (re-spectively deleted) and, possibly, some axioms that must beadded s.t. all test cases are satisfied. The symbol⊥ expressesa contradiction.

Definition 2 (KB-Diagnosis) [Friedrich and Shchekotykhin,2005] A KB-Diagnosis for a KB-Diagnosis Problem

2We assume the reader to be familiar with the basics of descrip-tion logic. For otherwise,[Baaderet al., 2003] provides an excel-lent introduction. In addition, we would like to stress thatusing DLknowledge bases as examples does not imply any limitation ofourapproach to DL. We used DL because of its importance for the Se-mantic Web.

(KB, B, TC+, TC−) is a setD ⊆ KB of sentences suchthat there exists an extensionEX , whereEX is a set of log-ical sentences added to the knowledge base, such that

1. ∀e+ ∈ TC+ : (KB − D) ∪ B ∪ EX ∪ e+ 6|=⊥

2. ∀e− ∈ TC− : (KB − D) ∪ B ∪ EX ∪ e− |=⊥

A minimal diagnosis is a diagnosis such that no proper sub-set is a diagnosis. A minimum cardinality diagnosis is a di-agnosis such that there exists no diagnosis with smaller car-dinality.

According to[Friedrich and Shchekotykhin, 2005, Corol-lary 1], we may characterizeEX by the conjunction of allnegated negative test cases. In particular,D is a diagnosis for(KB, B,TC+,TC−) iff ∀e+ ∈ TC+ : (KB − D) ∪ B ∪e+ ∪

∧

e−∈TC−(¬e−) is consistent. In order to keep the ex-

ample simple, we do not specify a background theory or testcases; but we require coherence and consistency.

Presenting Axiom66 as the only minimal single fault diag-nosis does not provide information about the parts of Axiom66, which caused the incoherence. In particular, we wouldlike to identify those parts of a faulty axiom, which must bechanged such that the requirements (e.g. coherence and com-pliance with test cases) are fulfilled.

In order to achieve this task, we base our principal idea onthe observation that axioms are composed by structures ac-cording to a predefined grammar. E.g. Axiom66 consistsof anAND-structure with 11 arguments. Each argument is astructure itself. In general, such structures are defined byagrammar, where the terminal symbols are literals. Exploitingthis observation, we recognize various possibilities for restor-ing coherence. Either theAND-structure must be changed(e.g. deleting the second argument of theAND-structure) orone of its arguments. Analyzing these arguments, we recog-nize that only the arguments(SOME R11 C74) and(ALLR11 C75) are relevant for producing an incoherent KB.Changes to arguments or operators of these structures willresolve the incoherence, e.g. by replacing theALL operatorby aSOME operator or by changing the names of concepts orroles. As a consequence, we can pinpoint parts of the axiom,which must be changed, thus exonerating the greater part ofAxiom 66. In the following, we will generalize the ideas pre-sented in the example.

3 Diagnosis of axiomsThe refinement of a KB-Diagnosis is based on the structure ofthe axioms according to the underlying syntax of the knowl-edge representation language. For our methods, we assumethat the syntax of the language is defined by a context-freegrammarG. In particular, we assume the usual structuring oflogic-based languages expressed by production rules accord-ing to the following prototypical rule:

V0 → op(V1, . . . , Vn, Nj∗),

whereV0 is a non-terminal, and the right-hand side is alogical structuredefined by a logical operatorop and ar-gumentsV1, . . . , Vn, which are logical structures., Nj

∗

denotes a possibly empty list ofnon-logical arguments, i.e.


numbers. Logical structures correspond to literals or other-wise can be recursively composed by applying operators tosimpler structures. Consequently, a logical structure is eitheran operator application or a literal. Context free grammarsmay have also single non-terminal symbol on the right-handside (V0 → Vm). In this case the syntax-tree[Linz, 1996]comprises intermediate nodes, which can be replaced by thesuccessor of this node. Therefore, we assume the followingstructuring of axioms, whereL is a logical structure, andLIis a literal:

L → op(L, . . . , L, Nj∗)|LI

Note thatop depends on the language. In case of FOL,opcorresponds to the usual logical connectives and to quantifi-cations of variables (e.g.∃ x). However, in DL,op corre-sponds to one of the logical operators defined in specific DLvariants (e.g.ALL, SOME, AND, OR). Furthermore, a LISP-like notation may be chosen. A simple grammar used for DLwithin the RACER system can be found in[Patel-Schneiderand Swartout, 1993], e.g.

C0 → CN | (AND C1 . . . Cn) | (ALL R C0) | . . .

whereCN is an atomic concept (i.e. a literal),Ci are con-cepts andR is a role.

In addition to logical argumentsVi of an operator, a lan-guage may exploitnon-logical argumentsNj , which aremodifying the meaning of this operator e.g. a DL could com-prise the(AT-LEAST N R) operator, whereN is a naturalnumber andR is a role. Viewing it as a first order logicstatement, a logical structure is introduced, which dependsonN. Regarding the meaning, we make the usual assumptionthat the semantics oflogical structuresis given denotation-ally using an interpretationI = 〈∆I , (·)I〉, where∆I is adomain (non-empty universe) of values and(·)I is an inter-pretation function. Literals are subsets of∆I or relationsover ∆I , and the semantics is inductively defined. Sincewe are interested in a general theory for diagnosing logi-cal knowledge bases, we constraint a semantics as little aspossible. In order to deal with variable symbols, we haveto define a partial functionµ, which provides a substitutionof some variables by elements of∆I . We assume that thesemantics ofop is defined byop(V1, . . . , Vn, Nj

∗)I,µ :=

op(V I,µ1 , . . . , V I,µ

n , Nj∗), whereop is defined as a par-

tial function that mapsV I,µ1 , . . . , V I,µ

n , Nj∗ to a value, de-

pending on the logic defined, e.g. to truth values in caseopis a logical connective in FOL or subsets of∆I in caseopis an operator of a DL. An axiom is satisfied byI, µ if it istrue. Note that non-logical structures are not interpretedbyI. E.g. the semantics of (AT-LEAST N R)I is defined asa ∈ ∆I

∣

∣ |b | (a, b) ∈ RI| ≥ N, whereN is a non-logical argument. Finally, we assume a monotonic semanticsof the logic.

As a consequence, every axiom can be represented by itssyntax-tree (cf. fig. 1. Every operator, literal, and non-logicalargument of an axiom is represented by a node. There is adirected arcn1 −→ n2 from noden1 to noden2, iff n2 isan argument ofn1. We then say thatn1 is the predecessorof n2. More generally, we can view the whole KB as onetree (called theKB-tree), where the first level (root node) is

Figure 1: Syntax tree for our example axiom. Arcs repre-sent relationships between operators and their correspondingarguments.

the start symbol of the grammar, and the second level corre-sponds to the operators (logical structures) defining axioms(e.g.DEFINE-CONCEPT). A complete subtree rooted at anode in theKB-tree defines alogical expression. An axiomis a logical expression itself.

Based on this representation, we can generalize the conceptof KB-Diagnosisin order to identify faulty logical structures.For this generalization, let every logical structureL (which iseither represented by an operator or by a literal) of aKB andits associated logical expressionE be uniquely identified bya markerM(L) rsp.M(E). The following example shows apossible marking of Axiom 66:

(DEFINE-CONCEPT66.0 C1366.1

(AND66.2 (SOME66.3 R2266.4 C6366.5 )(SOME66.6 R1166.7 C7466.8 ) . . . ))

Furthermore, we will exploit a replacement operatorKB[L/R], where the logical structureL is replaced by alogical expressionR in the knowledge baseKB. Replace-ments of logical structures are regarded as repairs for faultydescriptions. Therefore, we will only consider syntacticallyvalid replacements of a structure. In our example some ofthesesyntactically validreplacements forAND66.2 could bethe change to anOR operator, the addition of aNOT before theAND, or the introduction of a new operator (e.g. anOR) com-bining some arguments.KB[L1/R1, . . . , Ln/Rn] denotesthe simultaneous replacement ofL1, . . . , Ln by R1, . . . , Rn.The principal idea of the following definition is to identifythose logical structures, which are the cause for not satisfy-ing the requirements (e.g. failed test cases or incoherent KB).A set of logical structures is a cause for not satisfying the re-quirements iff there exists a replacement of these structures (apossible repair) such that all requirements are satisfied. Moreformally:

Definition 3 (AX-Diagnosis) Let LS be the set of logicalstructures of the knowledge baseKB. An AX-Diagnosisfor a KB-Diagnosis Problem(KB, B, TC+, TC−) is a setD = L1, . . . , Ln ⊆ LS such that there exist syntacticallyvalid replacementsRi for each logical structureLi ∈ D andan extensionEX , whereEX is a set of logical sentencesadded to the knowledge base, such that

1. ∀e+ ∈ TC+: KB[L1/R1, . . . , Ln/Rn] ∪ B ∪ EX ∪e+ 6|=⊥


2. ∀e− ∈ TC−: KB[L1/R1, . . . , Ln/Rn] ∪ B ∪ EX ∪e− |=⊥

We say that such a replacementKB[L1/R1, . . . , Ln/Rn]clears(or repairs) the fault. A minimal AX-DiagnosisD isdefined as usual by requiring that no subsetD′ ⊂ D is anAX-Diagnosis. Likewise,D is a minimum cardinality AX-Diagnosis, if there exists no AX-Diagnosis with smaller car-dinality thanD.

In our running example, there exist eight minimum cardi-nality (single fault) AX-diagnoses, marked by boxes:

( DEFINE-CONCEPT66.0

C13( AND

66.2(SOME R22 C63)

( SOME66.6

R1166.7

C7466.8

)

( ALL66.9

R1166.10

C7566.11

)(AT-MOST 3 R11) (AT-LEAST 2 R11)(SOME R14 *TOP*) (SOME R30 *TOP*)(AT-LEAST 2 R19) (SOME R4 *TOP*)(SOME R23 *TOP*) (SOME R2 *TOP*)))

Note that in fact each of these structures can be replacedin order to restore the coherence ofbike2, e.g. inAND66.2

the second and third argument could be replaced by a newlogical structure combining them by anOR, the concepts androlesC7466.8, C7566.11, R1166.7, andR1166.10 could be re-placed by other roles and concepts not mentioned inbike2.ALL66.9 could be replaced bySOME andSOME66.6 may bepreceded by aNOT. Finally, with respect to a replacement ofDEFINE-CONCEPT66.0, a complete deletion of Axiom 66(the axiom may be out-dated) is a possible repair.

Note that replacements of the logical structures in the ex-ample axiom, which are not one of the eight single fault di-agnoses do not restore the coherence ofbike2. More gener-ally, logical structures that are not contained in minimal diag-noses need not be considered for replacement. Consequently,a large fraction of the example axiom does not require furtherinvestigations by the knowledge engineer.

Note that if we focus the diagnosis process on minimumcardinality AX-Diagnoses then the exoneration of logicalstructures can be extended. LetN be the cardinality of theminimum cardinality diagnoses. Then it is clear that anyreplacement of a logical structure not contained in a mini-mum cardinality diagnosis can only resolve the fault if at leastN + 1 repairs are performed in total.

As it is generally the case in diagnosis, additional infor-mation as test cases and extensions to the background theorycould be exploited to reduce the number of most likely diag-noses (e.g. the number of minimum cardinality diagnoses). Inaddition, one could imagine a more sophisticated estimationregarding the likelihood of AX-diagnoses. However, sincewe focus on the foundations of diagnosing axioms both tasksare out of the scope of this work.

In case the knowledge engineer wants to change a logi-cal structure in addition to those contained in a minimal AX-Diagnosis, then no additional changes are necessary becauseof the following property:

Remark 1 Every superset of a minimal AX-diagnosis for aKB-Diagnosis Problem is an AX-diagnosis.

Since a replacement of a logical structure by a logical ex-pression may also include changes in the arguments, the con-cept of AX-Diagnosis shares some similarities with hierarchi-cal diagnosis:

Remark 2 Let D = L1, . . . , Li, . . . , Ln be an AX-Diagnosis for a KB-Diagnosis Problem(KB, B, TC+,TC

−), and let the logical structureL′i be a predecessor of

Li in theKB-tree. ThenL1, . . . , L′i, . . . , Ln (i.e. replac-

ing Li byL′i in D) is an AX-Diagnosis.

Roughly speaking, if we regard logical arguments as subcomponents, then the previous remark says that a faulty subcomponent also implies a faulty super component. However,the converse is not necessarily true. This converse is usuallyassumed in hierarchical diagnosis, i.e. if a super componentis faulty then at least one of its sub components is faulty. AnAX-DiagnosisL1, . . . , L

′i, . . . , Ln might contain a logical

structureL′i for, which there does not exist an AX-Diagnosis

L1, . . . , Li, . . . , Ln, whereL′i is replaced by a successor

Li w.r.t. theKB-tree, e.g. operators could be wrongly de-fined leading to an inconsistency independently of the logicalarguments. In the following section, we will show the basicmethods for the computation ofAX-Diagnoses.

4 Computation of Axiom DiagnosesThe principal idea of diagnosing axioms, is to translate anaxiom into a set of axioms, which allows the application ofthe diagnosis methods introduced in[Friedrich and Shcheko-tykhin, 2005]. For the translation, we assume that the logiccontains the equivalence operator (which may be simulatedby exploiting implication and conjunction). More formallywe require(V1 ≡ V2)

I,µ := (V I,µ1 = V I,µ

2 ). We ap-ply this operator in order to decompose logical expressions.Let E

yi

i be logical expressions with free variablesyi andXi(yi) be a unique literal with variablesyi as arguments. Alogical expressionop(E

y1

1 , . . . , Ey

n

n , Nj∗) is replaced by

op(X1(y1), . . . , Xn(yn), Nj∗) and a set of additional ax-

ioms X1(y1) ≡ Ey1

1 , . . . , Xn(yn) ≡ Ey

n

n . A completedecomposition for our sample Axiom66 would be the fol-lowing:

[66.0:] (DEFINE-CONCEPT X66.1 X66.2)[66.1:] (EQUIVALENT X66.1 C13)[66.2:] (EQUIVALENT X66.2 (AND X66.3 X66.6 . . . ))[66.3:] (EQUIVALENT X66.3 (SOME X66.4 X66.5))[66.4:] (ROLES-EQUIVALENT X66.4 R22)[66.5:] (EQUIVALENT X66.5 C63)[66.6:] (EQUIVALENT X66.6 (SOME X66.7 X66.8))

. . .[66.32:](EQUIVALENT X66.32 *TOP* )

This transformation preserves the interpretation of the orig-inal logical expression.

Remark 3 Let the logical expressionop(E

y1

1 , . . . , Ey

n

n , Nj∗) be decomposed as de-

scribed above, I = 〈∆I , (·)I〉 be an interpreta-tion, and µ a variable substitution for the free vari-ables. Then op(X1(y1), . . . , Xn(yn), Nj

∗)I,µ =

op(Ey1

1 , . . . , Ey

n

n , Nj∗)I,µ.


FUNCTION DECOMP(E): Return a set of axiomsInput: a logical expressionEif E is a literal with free variablesy then

if E is an axiomthen return [M(E) :]Eelse return[M(E) :]XM(E)(y) ≡ E

else letE = op(E1, . . . , En, Nj∗), where

yi are the free variables ofEi, andy =

⋃

i yi are the free variables ofEif E is an axiomthen return

[M(E) :]op(XM(E1)(y1), . . . ,XM(En)(yn), Nj

∗) ∪⋃

i=1,...,n DECOMP(Ei)else return[M(E) :]XM(E)(y) ≡

op(XM(E1)(y1), . . . , XM(En)(yn), Nj∗)∪

⋃

i=1,...,n DECOMP(Ei)

Figure 2: Decomposition of an axiom

The complete decomposition of an axiom and all its subex-pressions can be performed by the recursive function depictedin Figure 2. We exploit the markers of the logical expressionsE of aKB to mark corresponding axioms by[M(E) :].

Based on a decomposition of the axioms of a KB, wecan show that there is a one-to-one correspondence betweenthe KB-Diagnosesof the decomposedknowledge base andthe AX-Diagnosesof the original one. For this correspon-dence, we make the reasonable assumption that for every log-ical structure there is a syntactically valid replacement bya unique literal. This assumption holds for FOL and DL.Following definition 3, we use the newly introduced liter-als as replacementsRi for the logical sub-structuresLi ofthe original axiom inKB. Let every axiomAx of the de-composition be identified byM(Ax) and DECOMP(KB) =⋃

Ax∈KB DECOMP(Ax) is the decomposition of the knowl-edge baseKB.

Proposition 1 Provided the knowledge representationlanguage allows for every logical structure a syntacticallyvalid replacement by a literal,DAX = L1, . . . , Ln is anAX-Diagnosis for the KB-Diagnosis Problem(KB, B,TC+, TC−), iff DKB = M(L1), . . . , M(Ln)is a KB-Diagnosis for the KB-Diagnosis Problem(DECOMP(KB), B, TC+, TC−).

Note that the inverse transformation of the compositioncan be obtained by backsubstituting the equivalent expres-sions for each newly introduced literal. For our example, theoriginal axiom arises by substituting the expressions forX66.1

(which isC13) andX66.2 (which is(AND X66.3 X66.6 . . .)),and recursively substituting the equivalent expressions for allX-literals therein. By this backsubstitution, we are able topinpoint the logical structures in the axioms, which corre-spond to the elements ofAX-Diagnoses. Furthermore, it isnot needed to decompose the whole knowledge-base, but toapply the decomposition on demand, e.g. if the knowledgeengineer has an interest to investigate an axiom more deeply,because on the level of axioms, a single leading diagnosis wasidentified.

As a consequence, we can exploit all the methods forcomputing KB-Diagnoses as described in[Friedrich andShchekotykhin, 2005] for the computation of axiom diag-noses. Based on Proposition 1, it follows easily that the

complete and correct algorithm of[Friedrich and Shcheko-tykhin, 2005] for the generation of minimal KB-Diagnosescan be applied to generate soundly the set of all minimal AX-Diagnoses. It remains the question regarding the practicalapplicability, which will be addressed in the next section.

5 EvaluationWe implemented the diagnostic engine in Java as describedin [Friedrich and Shchekotykhin, 2005], with extensions forcalculating axiom diagnoses. Benchmarks were run on a PC(Pentium IV with 2GHz and 512MB RAM) with WindowsXP SP2 as operating system. We conducted numerous testsusing the files from the RACER test suite.

In order to be comparable, we applied the same test settingas described in[Friedrich and Shchekotykhin, 2005], i.e. weconducted 30 tests for each knowledge base, where each testrandomly changed the knowledge-base such that each changeon its own leads to an incoherent KB. The diagnosis task is tofind minimum cardinality diagnoses in order to restore coher-ence. We employed QUICKXPLAIN [Junker, 2004] for find-ing minimal conflict sets, and RACER for coherence checks.Minimal conflict sets are computed on demand in order tolabel the hitting set tree3 (HS-Tree) exploited for the compu-tation of minimal diagnoses.

The evaluation was carried out in two steps, the first ofwhich determined the minimum cardinality KB-Diagnosesfor a test. In the second step, we emulated the behavior ofa knowledge engineer who wants to investigate an axiomAxof a KB-DiagnosisD more deeply, e.g. to question, whichparts ofAx must be changed provided thatD is the preferreddiagnosis. Therefore, we selected an axiomAx from a KB-Diagnosis found in the first step. Since we are only inter-ested in AX-Diagnoses w.r.t.Ax, this question correspondsto computing AX-Diagnoses, where the background theoryis extended byKB −D, i.e. Be := B ∪ (KB −D), and theknowledge base, for which we are calculating AX-diagnosesis DECOMP(Ax).

In case we want to compute the minimal AX-Diagnoses ofan axiom contained in a KB-Diagnosis, the following heuris-tic allows a faster computation by reusing minimal conflictsets found in the KB-Diagnosis step. Informally, a conflictsetCS of a knowledge baseKB is a minimal subset ofKB,which is inconsistent with at least one positive test case uni-fied with the background theory, or just the background the-ory in case there are no positive test cases. Such a set iscalledminimal, if no proper subset ofCS is a conflict set.Let D be a KB-diagnosis and letCSKB = CS1, . . . , CSk

be the minimal conflict sets found in the KB-Diagnosis step,which containAx. It follows that(CSi−D)∪DECOMP(Ax)must contain a minimal conflict set. Therefore when comput-ing a label for the HS-Tree, in a first step, we useB′

e :=

B ∪ [⋃k

i=1 CSi − D] as a reduced background theory. If wecannot find a minimal conflict w.r.t. this reduced backgroundtheory, in a second step, we use the full background theoryBe

to generate a conflict set. Note that if for a noden in the HS-Tree, we cannot find a conflict w.r.t. the reduced background

3We assume the reader to know the basics of model-based diag-nosis[Reiter, 1987].


KB #KBC |KBC| #AD #AC #IAD DTRB H-ADT ADT KB-DTbike1 min 4 4 5 1 0 0,66 2,15 8,56 6,1781 ax avg 4,07 4,27 5,43 1,57 0 1,92 3,71 34,02 27,58

max 5 4 15 2 0 3,48 8,18 87,25 59,77bike2 min 6 4 4 1 0 1,02 3,63 8,14 21,69154 ax avg 5,67 4,1 3,13 1,43 0,1 1,86 5,4 17,43 52,84

max 3 4 3 1 1 1,41 18,67 31,89 100,85bike3 min 4 3 5 1 0 0,77 4,01 7,31 24,34109 ax avg 3,97 3 5 1 0 0,86 4,66 53,16 33,02


max 5 4 3 1 2 1,18 23,04 426,19 157,99bike5 min 3 4 3 2 0 3,25 6,61 36,03 30,98184 ax avg 3,6 3,97 6,2 1,33 0 1,83 12,85 101,84 114,05


max 2 3 6 1 3 1,72 64,7 493,52 217,1bike7 min 1 3 3 1 0 0,49 3,64 30,25 24,43162 ax avg 2,63 3 5,27 1,23 0 1,76 11,89 59,56 63,05

max 3 3 9 1 0 5,47 27,18 97,3 113,6bike8 min 1 3 3 1 0 0,48 3,92 33,03 30,99185 ax avg 2,57 3 5,33 1,2 0 1,28 11,83 52,15 65,65


max 4 4 6 1 3 2,22 53,94 536,98 234,47galen min 4 2 3 2 0 12,98 52,27 124,78 142,033963 ax avg 3,47 2 5,4 1,2 2,07 11,48 146,98 252,71 244,52

max 2 2 9 2 15 26,17 507,33 523,55 525,66galen2 min 3 2 5 1 0 5,95 39,2 98,58 90,33927 ax avg 3,03 2 5,2 1,3 2,5 12,7 94,61 146,27 141,83

max 2 2 9 2 15 23,94 237,21 236,79 287,45bcs3 min 2 3 1 3 0 1,61 3,16 5,86 5,36432 ax avg 3,33 25,63 2,03 1,67 0,87 1,68 24,15 154,45 126,73

max 3 2 2 1 2 0,72 99,64 550,39 528,54

Table 1: Test results for diagnosing faulty axioms.

50

150

250

100

200

bike181

bike2154

bike3109

bike4166

bike5184

bike6207

bike7162

bike8185

bike9215

bcs3432

galen23927

galen3963

time[s]

Figure 3: Comparing the performance with and without heuristic diagnosis according to the size of the KB. The bars in theback show the average times (in seconds) for finding minimum cardinality AX-Diagnoses without heuristic improvements.The black bars in the middle display the average time required when the heuristic is exploited. The small bars in the frontshow the average time required for finding minimum cardinality AX-Diagnoses with reduced background (i.e. before checkingconsistency w.r.t.B ∪ (KB − D)).


theoryB′e, then this property holds also for all successors of

n. Hence, we can omit the first step for these successor nodes.Table 1 depicts the results using this heuristic for calculat-

ing diagnoses and compares the results to the plain methodwithout heuristic speedup. Columns are to be interpreted asfollows: H-ADT and ADT are the times for generating theminimum cardinality AX-Diagnoseswith heuristic speed-up(H-ADT), andwithout heuristic speed-up (ADT). KB-DT isthe time for running the initial knowledge base diagnosis. Alltimes are given in seconds and include the time for decom-positions. For eachKB, we characterized the fastest (min)and slowest (max) heuristic axiom diagnosis by the numberof minimum cardinality AX-Diagnoses (#AD) and the num-ber of minimal conflicts (#AC) generated during diagnosingthe axiom. As the heuristic requires the conflicts sets withinthe full KB, we additionally provide the number ofKB-conflicts (#KBC) and their maximum size (|KBC|) for thefastest and slowest case, as well as the number of invalid min-imum cardinality AX-Diagnoses (i.e. those AX-Diagnoses ofthe reduced background theoryB′

e being inconsistent withthe full oneBe, shown in column#IAD) and the time forfinding the minimum cardinality AX-Diagnoses w.r.t. the re-duced backgroundB′

e (DTRB). The “avg” row shows thesevalues averaged over all tests for a singleKB. The unit “ax”denotes the number of axioms in aKB.

In addition, we improved the application of QUICKX-PLAIN . QUICKXPLAIN takes two arguments (aKB anda background theoryB) and computes a minimal subset ofKB (called a minimal conflict set), which is inconsistent withB provided thatB is consistent. If such a set does not ex-ist (i.e. KB ∪ B is consistent), it outputs consistent. Ex-periments showed that QUICKXPLAIN performs better ifBis small due to the divide-and-conquer technique, which re-ducesKB rapidly (but notB), and therefore reduces the costsfor consistency checking. A reduction of the size ofKB re-sults indisproportionatespeed-up. Note that this is also thereason why QUICKXPLAIN works well for large knowledgebases. Consequently, if we diagnose DECOMP(Ax) w.r.t. (apossibly large)Be, a divide-and-conquer strategy will notprovide significant acceleration. Hence, we adopt anotherstrategy for generating minimal conflicts by leavingB un-touched; just replacing the axiom inD we want to investigateby DECOMP(Ax).

The effort for finding an AX-diagnosis by running a KB-diagnosis on(KB − D) ∪ DECOMP(Ax) with backgroundB and no heuristic speedup, is approximately the same as forcalculating a diagnosis for the plain knowledge-base (priorto the axiom-diagnosis). The additional cost if the heuristicfails (no diagnosis found by using the reduced backgroundtheory) is negligible, compared to the time spent for using theextended background theoryBe. The heuristic for generat-ing minimum cardinality AX-diagnoses can be summarizedas follows:

1. Let a KB-diagnosisD be given forKB and backgroundB (and possibly existing test-cases, which we omit forbrevity). LetAx ∈ D be the axiom to be investigated.

2. Collect all conflictsCS1, . . . , CSk from the HS-tree,where D comes from, and createB′

e by adding all

CSi − D to B.

3. Run a diagnosis on DECOMP(Ax) with backgroundB′e.

4. Check the resulting diagnoses for consistency w.r.t.B ∪(K−D)4. If none of them turns out to be consistent, runa diagnosis on(KB − D) ∪ DECOMP(Ax) with back-groundB.

For 310 experiments, we compared the times for gener-ating minimum cardinality AX-Diagnoses with and withoutheuristic (see Figure 3). As we recognize, the heuristic al-ways achieved a considerable improvement. In particular,the time for diagnosing w.r.t. the reduced background the-ory is almost negligible compare to the diagnosis time for thefull background theory. The expensive step is (as expected)to check if diagnoses of the reduced background theory areconsistent with the full background theory. Therefore, thespeed up depends on the number of diagnoses which mustbe checked and the cost of consistency checking. Note thatin our experiments the minimum cardinality diagnoses of thereduced background theory always contained those of the fullbackground theory. The number of diagnoses deleted in themaximum average cases is rather small (around two, see Ta-ble 1), which means that the diagnoses of the reduced prob-lem are already a very good approximation for the minimumcardinality diagnoses of the full problem.

The running time for finding an axiom diagnosis is stronglydepending on the complexity of an axiom, as well as onthe size and structure of the knowledge base. Note that inour evaluation, RACER does not provide any information onwhich axioms are used for discovering an incoherence. Iftheorem provers are available, which return the set of axiomsapplied during theorem proving (i.e. they return a conflict set,which is not necessarily minimal, but smaller than the wholeKB) then QUICKXPLAIN could start with such a reducedset. This will result in significant speed-ups.

6 Related workMost closely related to our approach is the work of[Schlobach and Cornet, 2003]. In this paper, a method fordebugging faulty knowledge bases as well as faulty axioms isproposed. The approach is calledconcept pinpointing. Con-cepts are diagnosed by successive generalization of an axiomuntil the most general form that is still incoherent is achieved.Generalizations are based on a syntactic relation, which isas-sumed to exist and left to the knowledge engineer. Moreover,the approach will find concepts only.

In [Schlobach, 2005], three approaches are compared forgenerating diagnoses of terminologies, which differ in thesize of the conflict sets returned by the theorem prover. Thefirst approach always considers the complete knowledge base(which served as an input for the theorem prover) as a conflictset. In the second approach, the theorem prover returns a min-imized (not necessarily minimal) conflict set, and in the third

4Prior to invoking RACER for checking coherence, one cansearch for known conflicts to appear in the set of axioms, and calcu-late a new conflict if nothing is found. This conflict can be re-usedin subsequent checks.


approach, a special procedure for computing minimal con-flicts in unfoldable (i.e. acyclic)ALC-Tboxes is employed.Compared to these evaluations, we employ QUICKXPLAINto generate minimal conflict sets in order to avoid the knownproblems of non-minimal conflicts for the HS-tree genera-tion. Since we are not restricted to a special theorem prover,our approach is not restricted to unfoldableALC-Tboxes.However, a runtime comparison of generating minimal con-flict sets by the method described in[Schlobach and Cornet,2003] and the combination of QUICKXPLAIN with a highlyoptimized consistency checker is open. Note that this com-parison also strongly depends on the ability of the consistencychecker to return axioms which were involved in the genera-tion of an inconsistency as described above.

In [Mateiset al., 2000] model-based diagnosis of Java pro-grams is discussed. Similar to our approach, the grammar ofthe Java language is the starting point of the transformation,but unlike our method, the approach is designed to be usedwith imperative semantics, while we focus on declarative se-mantics.

In the heuristic approaches of[Parsionet al., 2005] and[Wanget al., 2005], debugging cues respectively error pat-terns are exploited. In contrast to these approaches, our goalwas to provide a general, complete, and correct method fordiagnosis. Nevertheless, these heuristic approaches may helpus to discriminate between minimal diagnoses. Based on aconnection relation,[Haaseet al., 2005] present a method forcomputing a minimal subset of an ontology, in which a con-cept is unsatisfiable. However, they do not identify the partsof axioms, which cause this unsatisfiability.

7 ConclusionsWe presented a general theory of diagnosis for faulty knowl-edge bases, which allows the identification of faulty parts ofaxioms. This approach subsumes current methods and is in-dependent of FOL knowledge representation language vari-ants or particular reasoning systems. Based on the roots ofmodel-based diagnosis, we were able to develop correct andcomplete algorithms for the computation of axiom diagnoses.We have shown the feasibility of our approach by extensivetest evaluation, and provided an extension of current diagno-sis methods s.t. a considerable speed up for the diagnosis ofaxioms is achieved.

AcknowledgmentsThe research project is funded partly by grants from the Aus-trian Research Promotion Agency, Programm Line FIT-ITSemantic Systems (www.fit-it.at), Project AllRight, Contract809261 and the European Union, Project WS-Diamond, Con-tract 516933.

References[Baaderet al., 2003] F. Baader, D. Calvanese, D.L. McGuin-

ness, D. Nardi, and P.F. Patel-Schneider, editors.The De-scription Logic Handbook: Theory, Implementation, andApplications. Cambridge University Press, 2003.

[Friedrich and Shchekotykhin, 2005] G. Friedrich andK. Shchekotykhin. A general diagnosis method forontologies. In4th ISWC 2005, volume 3729. SpringerLNCS, 2005.

[Haaseet al., 2005] P. Haase, F. Harmelen, Z. Huang,H. Stuckenschmidt, and Y. Sure. A framework for han-dling inconsistency in changing ontologies. In4th ISWC2005, volume 3729. Springer LNCS, 2005.

[Junker, 2004] U. Junker. QUICKXPLAIN: Preferred expla-nations and relaxations for over-constrained problems. InProc. AAAI 04, pages 167–172, San Jose, CA, USA, 2004.

[Linz, 1996] P. Linz. An introduction to formal languagesand automata. Lexington Books, Mass., 2 edition, 1996.

[Mateiset al., 2000] C. Mateis, M Stumptner, andF. Wotawa. Modeling java programs for diagnosis.In ECAI 2000, pages 171–175, 2000.

[Parsionet al., 2005] B. Parsion, E. Sirin, and A. Kalyanpur.Debugging OWL ontologies. InWWW 2005, Chiba, Japan,May 2005. ACM.

[Patel-Schneider and Swartout, 1993] P.F. Patel-Schneiderand B. Swartout. Description-logic knowledge representa-tion system specification. Technical report, KRSS Groupof the DARPA Knowledge Sharing Effort, November1993.

[Reiter, 1987] R. Reiter. A theory of diagnosis from firstprinciples.Artificial Intelligence, 23(1):57–95, 1987.

[Schlobach and Cornet, 2003] S. Schlobach and R. Cornet.Non-standard reasoning services for the debugging of de-scription logic terminologies. InProc. IJCAI 03, pages355–362, Acapulco, Mexico, 2003.

[Schlobach, 2005] S. Schlobach. Diagnosing terminologies.In Proc. AAAI 05, pages 670–675, Pittsburgh, PA, USA,2005.

[Wanget al., 2005] H. Wang, M. Horridge, A. Rector,N. Drummond, and J. Seidenberg. Debugging owl-dl on-tologies: A heuristic approach. In4th ISWC 2005, volume3729. Springer LNCS, 2005.


Robust Fault Detection with State Estimators and Interval Models usingZonotopes

Pedro Guerra, Vicenc Puig, Ari IngimundarsonAutomatic Control Department

Universitat Politecnica de Catalunya (UPC)Rambla de Sant Nebridi, 11, 08222 Terrassa (Spain)

[email protected]

Abstract

In this paper, the problem of robust fault detectionconsidering process/measurement noises and mod-eling uncertainties is addressed with two differentstate estimation strategies based on zonotope rep-resentation of the state space. First, a consistency-based approach based on propagating the uncer-tainty with zonotopes is proposed. Second, a worst-case state estimation strategy based on adaptivethresholds using zonotopes is presented. In bothstrategies, the modeling uncertainty is representedby bounding model parameters in intervals. Pro-cess and measurement noise are also considered un-known but bounded. Finally, an example based ona linearised model of a flight control system is pro-posed to compare both approaches.

1 IntroductionModel-based fault detection techniques have been investi-gated and developed within DX and FDI communities overthe last few years. Model-based fault detection is based onthe use of mathematical models of the monitored system. Thebetter the model used to represent the dynamic behavior ofthe system, the better is the chance of improving the reliabil-ity and performance in detecting faults. However, modelingerrors and disturbances in complex engineering systems areinevitable, and hence there is a need to develop robust faultdetection algorithms.

The most common approach to deal with the robustnessproblem in the FDI community is based on the decouplingprinciple, in which the residual is designed to be insensitive tounknown disturbances, whilst sensitive to faults using the un-known input observer, eigenstructure assignment [Chen andPatton, 1999] or structured parity equations [Gertler, 1998].Using one of these approaches, the robustness with respectto unknown disturbances is solved. However, the robust-ness problem with respect modeling errors is more difficultto solve because its distribution matrix is normally unknownand should be estimated, being many times time varying andmoreover it could be too many disturbances to be decoupleddue to the lack of freedom. An alternative strategy to considermodeling error as disturbances and to decouple their effect is

to propagate and bound its effect on the residual using for ex-ample interval methods [Puig et al., 2002b]. This will be theapproach followed in this paper to handle modeling uncer-tainties. On the other hand, process and measurement noisesare usually modeled stochastically using restrictive assump-tions concerning the distribution law (the typical assumptionis a zero mean white noise). However, in many practical sit-uations it is more natural to assume that only bounds on thenoise signals are available [Milanese et al., 1996]. This willalso be the approach to describe noise signals used in thispaper. Unfortunately, the set of states obtained propagatingparameter and noise uncertainty may become extremely com-plex. Then, in the literature several approximating sets to en-close the set of possible states has been proposed. In [Witczaket al., 2002], a state estimator based on enclosing the set ofstates by the smallest ellipsoid is proposed following the al-gorithms proposed by [Maksarov and Norton, 1996]. How-ever, in this approach only additive uncertainty is considered,but not the multiplicative uncertainty introduced by model-ing uncertainty located in the parameters. Here, both types ofuncertainties are considered as in [Rinner and Weiss, 2004],but there only system trajectories obtained from the uncertainparameter interval vertices are considered, assuming that themonotonicity property holds. In this paper, two approaches ofstate estimators based on enclosing the set of states by zono-topes are presented without assuming any monotonicity prop-erty and considering the whole set of possible trajectories.The first approach is the consistency based approach based ondetermining the set of states that are consistent with param-eter and measurement uncertainty. The second approach is aworst-case state estimation based on bounding the set of pos-sible states through computing the worst-case state estimationdue to parameter and measurement uncertainty. The paperis organized as follows: In Section 2, consistency-based andworst-case state estimation principles are introduced. Section3 presents the implementation of both approaches using zono-topes. And, in Section 4, these approaches are applied to faultdetection. Finally, in Section 5 an application example basedon a linearised model of a flight control system is presentedto compare both strategies.


2 Consistency-based and worst-case stateestimation principles

2.1 System set-upLet us consider the following discrete-time linear system:

xk+1 = (A + ∆Ak)xk + (B + ∆Bk)uk + wk

yk = Cxk + vk (1)where:• xk ∈ Rnx, u ∈ Rnu and y ∈ Rny are state, input and

output vectors of dimension nx, nu and ny respectively;• vk ∈ Rnv and wk ∈ Rnw are measurement and pro-

cess noise of dimension nv and nw respectively; that areconsidered unknown but bounded, i.e. vk ∈ Vk, andwk ∈ Wk, where Vk and Wk are interval boxes:Vk = vk ∈ Rnv|vk ≤ vk ≤ vk ,Wk = wk ∈ Rnw|wk ≤ wk ≤ wk

• A, B and C are the state space matrices and ∆A and ∆Brepresent the associated modelling errors;

If modelling errors ∆A and ∆B are located in the param-eters, a vector of uncertain time-varying parameters θk of di-mension p with their values bounded by a compact set θk ∈ Θof a box type, i.e. Θ = θ ∈ Rp|θ ≤ θ ≤ θ , is intro-duced. The uncertain parameters are considered time vary-ing. This type of model is known as an interval model. Then,system matrices including its associated uncertain parameterswill be described in the following way: A(θk) and B(θk) re-spectively.

2.2 The consistency-based estimation principleA consistency-based state estimator assumes a priori boundson noise and uncertain parameters and constructs sets of es-timated states that are consistent with the a priori boundsand current measurements. Several researchers as [Chisciet al., 1996][Maksarov and Norton, 1996] [Shamma, 1997],[Calafiore, 2001] and [Kieffer et al., 2002], among others,have addressed this issue.

Definition 1. Consider a system given by Eq. (1), an initialcompact set X0 and a sequence of measured inputs (uj)k−1

0

and outputs (yj)k0 , the exact uncertain state set at time k us-

ing the set-membership approach is expressed by

Xk = xk : (xj = A(θj−1)xj−1 + B(θj−1)uj−1

+wj−1)kj=1, (yj = Cxj + vj)k

j=1

The uncertain state set described in Definition 1 at time kcan be computed approximately by admitting the rupture ofthe existing relations between variables of consecutive timeinstants. This makes possible to compute an approximationof this set from the approximate uncertain state set at timek− 1. In the linear case and considering only additive uncer-tainty, the set of uncertain states generally takes the form of(convex) polytopes and in the literature efficient algorithmsexist to deal with [Chisci et al., 1996][Maksarov and Norton,1996]. But, in the non-linear case (or in the linear case withmultiplicative uncertainty), an explicit construction of the setof possible states is essentially prevented by the generality of

shapes [Kieffer et al., 2002]. Using set computations, it ispossible to define an algorithm for the non-linear case thatconstructs an approximation of set of uncertain states, Xk,which are consistent with the current measurement trajectoryand with the bounds of noise, disturbances and parameter un-certainty. Before introducing such algorithm two additionaldefinitions are introduced.

Definition 2. Consider a system given by Eq. (1), the setof uncertain states at time k-1, Xk−1, and the input/ouputpair (uk−1,yk−1). Then, the set of predicted states at time kbased on the measurements up to time k-1 is defined asXe

k = xk : A(θk−1)xk−1 + B(θk−1)uk−1 + wk−1|xk−1 ∈ Xk−1, θk ∈ Θ,wk−1 ∈Wk−1

Definition 3. Consider a system given by Eq. (1) and ameasured output yk. Then, the set of consistent states at timek with such measurement is defined asXyk

k = xk : yk = Cxk + vk, θk ∈ Θ,vk ∈ Vk

Then, the following algorithm can be introduced todetermine an approximation of set of uncertain states:

Algorithm 1. ”Consistency-based State Estimator usingSet-computations”Considering a system given by Eq. (1), an initial compact setX0 and a sequence of measured inputs (uj)k−1

0 and outputs(yj)k

0 , at each sample time k:Step 1: Compute the set of predicted states, Xe

kStep 2: Compute the set of consistent states, Xyk

kStep 3: Compute the set of uncertain states as

Xk = Xek ∩ Xyk

k

Except for very particular cases, it is not possible to eval-uate exactly the three sets Xe

k, Xyk

k and Xk required in Al-gorithm 1. Instead guaranteed outer approximations of thesesets, as accurate as possible, have been used in the literature.In the case of non-linear systems (or systems including mul-tiplicative uncertainty), such outer approximations are basedon subpavings [Kieffer et al., 2002], ellipsoids [ElGhaoui andCalafiore, 1997],[Polyak et al., 2004], zonotopes [Alamo etal., 2005], among others.

2.3 The worst-case state estimation principleLet the model for the state estimator of the system describedby Eq. (1) be a Luenberger observer formulated as

xk+1 = A(θ)xk + B(θ)uk + wk + K(yk − yk)yk = C(θ)xk + vk

(2)

where K is the observer gain. When K = 0 the state ob-server becomes a simulator, while when K = A becomes apredictor.

Definition 4. Consider the state estimator given by Eq. (2),an initial compact set X0 and a sequence of measured inputs(uj)k−1

0 and outputs (yj)k0 . The exact uncertain state set at

time k using the worst-case approach is expressed by

Xk = xk : (xj = A(θj−1)xj−1 + B(θj−1)uj−1

+wj + K(yj − yj),yj = Cxj + vj)kj=1


As in the case of the uncertain state set described in Def-inition 1 at time k, the uncertain state set described in Defi-nition 2 at time k can be computed approximately by admit-ting the rupture of the existing relations between variables ofconsecutive time instants. This makes possible to computean approximation of this set from the approximate uncertainstate set at time k − 1. Because the exact set of estimatedstates would be difficult to compute, one straightforward wayto bound this set is using a box (interval hull)[Puig et al.,2002a], a zonotope [Puig et al., 2001] or other geometric re-gions easy to compute [Puig et al., 2005]. In this paper, the setof estimated states will be computed iteratively using zono-topes. From these zonotopes a worst-case estimation of eachstate variable can be obtained by computing the interval hullof the zonotope. The sequence of interval hulls ¤Xk withk ∈ [0, N ] will be called the worst-case estimation of thesystem given by Eq. (1).

Then, the following algorithm can be introduced todetermine an approximation of set of uncertain states:

Algorithm 2. ”Worst-case State Estimator using Set-computations”Considering the state estimator given by Eq. (2), an initialcompact set X0 and a sequence of measured inputs (uj)k−1

0

and outputs (yj)k0 , at each sample time k:

Step 1: Compute the set of uncertain states, Xk

Step 2: Compute the interval hull of the set of uncertainstates:

¤Xk = [x(k),x(k)] (3)

2.4 Comparison between both approachesThe main difference between both state estimation ap-proaches presented in previous sections is how measurementsare taken into account. In the case of the consistency-basedapproach the effect of measurements is taken into account im-plicitly by the intersection of the set of states consistent withthe measurements and the set of states predicted using themodel. The degree of correction depends of the relative val-ues of process and measurement noise. If the level of mea-surement noise is very low, the set of states will almost bereduced to the set of states consistent with the measurements.On the other hand, in the case of the worst-case approach theeffect of measurements in the correction of state estimation isconsidered explicitly through the selection of the gain K. Incase the process and measurement noise are modeled stochas-tically (Kalman filter), the value of the observer gain K willalso depend on the relative values of process and measure-ment noise as in the consistency-based approach what allowa possible comparison between both approaches.

3 Implementation using zonotopes3.1 IntroductionIn this paper, zonotopes are used to bound the set of uncertainestimated sets. Let us introduce zonotopes.

Definition 5. The Minkowski sum of two sets X and Y isdefined by X⊕ Y = x+y : x ∈ X, y ∈ Y.

Definition 6. Given a vector p ∈ Rn and a matrix H∈ Rn×m the Minkowski sum of the segments defined by thecolumns of matrix H, is called a zonotope of order m. Thisset is represented as:

X = p⊕HBm = p + Hz : z ∈ Bmwhere: Bm is a unitary box, composed by m unitary intervals.

Then, a zonotope X of order m can be viewed as theMinkowski sum of m parallelepipeds (Figure (1). The orderm is a measure for the geometrical complexity of the zono-topes.

Figure 1: Zonotope of order m=14

Definition 7. The interval hull ¤X of a closed set X is thesmallest interval box that contains X.

Given a zonotope X = p ⊕ HBm, its interval hull can beeasily computed by evaluating p⊕HBm using interval arith-metic since:

¤X = x : |xi − pi| ≤ ‖Hi‖1 (4)

where Hi is ith-row of H, and xi and pi are ith componentsof x and p, respectively.

3.2 Implementation of consistency-basedestimators

According to Algorithm 1, consistency-based state estimationinvolves three bounding operations applied to the set of pre-dicted states Xe

k, the consistent state set Xyk

k and their inter-section Xk.

A. Implementation of prediction set stepThe prediction set step requires characterizing the set Xe

k.This set can be viewed as the direct image evaluation off(xk, θk,wk) = A(θk)xk + B(θ)uk + wk. There are dif-ferent algorithms to bound such an image using subpavings[Kieffer et al., 2002], ellipsoids [Polyak et al., 2004] or zono-topes [Kuhn, 1998]. To bound such image using zonotopesthe following result is used:

Theorem 2. ”Zonotope Inclusion” [Alamo et al., 2005].Consider a family of zonotopes represented by X = p ⊕MBm where p ∈ Rn is a real vector and M ∈ In×m is


an interval matrix. A zonotope inclusion ¦(X) is defined by:

¦(X) = p⊕ [mid(MG)][

Bm

Bn

]= p⊕ JBn+m

where G ∈ Rn×n is a diagonal matrix that satisfies: Gii =m∑

j=1

diam(Mij)2 , i = 1, 2 . . . n. with ”mid” denotes the cen-

ter and ”diam” the diameter of the interval [Moore, 1966].Under this definition, X ⊆ ¦(X).

This prediction step aims at computing the zonotope Xek+1

that bounds the trajectory of the system at instant k+1, fromthe previous approximating zonotope at time instant k, Xk,using the interval mean-value extension of Eq. (1) [Moore,1966] and the zonotope inclusion operator, as a generalizationof Kunh’s method [Kuhn, 1998]:

Xek+1 = pk+1 ⊕Hk+1Br (5)

where:

pk+1 = mid(A(θk))pk + mid(B(θk))u

and,Hk+1 = [J1 J2 J3 W]

J1 = ¦(A(θk)Hk)J2 = ¦(A(θk)−mid(A(θk))pk)

J3 = u(diam(B(θk))/2)W is the process noise. J1 and J2 are calculated using thezonotope inclusion operator.

It is important to notice that the set of predicted statesincreases the number of segments generating the zonotopeXe

k+1 using this method. In order to control the domain com-plexity, a reduction step is thus implemented. Here we use themethod proposed in [Combastel, 2003] to reduce the zono-tope complexity:

Property 1. Given the zonotope X = p ⊕ HBr ⊂ Rn

and the integer s (with n < s < r), denote by H the ma-trix resulting from reordering of the columns of matrix H indecreasing Euclidean norm. Then, X ⊆ p ⊕

[HT Q

]Bs,

where HT is obtained from the first s-n columns of ma-trix H and Q ∈ Rnxn is a diagonal matrix that satisfies:Qi,i =

∑rj=s−n+1 |Hij |, i=1...,n.

B. Implementation of consistent set stepThe consistent state set step requires characterizing the setXyk

k taking into account the information provided by mea-surements yk. This set can be viewed as the inverse imageevaluation of g(xk, θk,vk) = Cxk + vk. There are algo-rithms based on bounding such an image using, for example,subpavings [Kieffer et al., 2002], ellipsoids [Polyak et al.,2004] or zonotopes [Alamo et al., 2005].To bound the consistent set of states using zonotopes, the in-tersection of p strips in the state space is calculated. Givena measurement yk ∈ Rp, the consistent state set Xyk

k intro-duced in Definition 3 corresponds the region between two hy-perplanes. Define now sets Xyk

k (i) = xk : ∃(vk ∈ Vk, θk ∈Θ) such that yk ∈ gi(xk, θk,vk) where gi denotes the ith

component of g ∈ Rp. It is clear that: Xykk ⊆ p∩

i=1Xyk

k (i).

C. Implementation of intersection set stepFinally, the intersection set step requires characterizing the setXk. This set is the intersection of the two previous boundedsets: Xk = Xe

k ∩ Xyk

k . This set can be approximated usingalgorithms based on bounding such image by, for example, el-lipsoids [Polyak et al., 2004], zonotopes [Alamo et al., 2005]or subpavings [Kieffer et al., 2002]. In this paper, again weuse zonotopes to bound this set.

Given the zonotopeXek = pk⊕HBr, the stripXyk

k = x ∈Rn : |cT x-d| ≤ σ and vector λ ∈ Rn, if Xe

k ∩ Xyk

k 6= ∅, wehave:

Xek ∩ Xyk

k ⊆ Xk = p(λ)⊕ H(λ)Br+1 (6)

where:p(λ) = p + λ(d− cT p) (7)

H(λ) = [(I− λcT)H λ] (8)

It is possible to choose the parameter vector λ in such away that a size criterion for the obtained bound is minimized.Here, we use the method based on the Frobenius norm pro-posed in [Alamo et al., 2005].

3.3 Implementation of worst-case estimatorsOn the other hand, to implement worst-case estimators us-ing zonotopes, it should be noticed that using Eq. (2) as theexpression of the estimator model, it can be rearranged as adiscrete-time system with two inputs that can be reorganisedas:

xk+1 = (A−KC)xk + [ B I K K ]

uk

wk

yk

vk

(9)

Or, equivalently:

xk+1 = Aoxk + Bouok (10)

where: Ao = A−KC, Bo = [ B I K K ] and u0k =

[ uk wk yk vk ]t.Then, the problem of worst-case state observation can be

formulated as a problem of worst-case simulation and re-quires characterizing the setXk. This set can be viewed as thedirect image evaluation of Eq. (9) and can be implementedusing zonotopes as in the prediction step in consistency-basedstate estimators.

4 Application to fault detection4.1 Fault detection using consistency-based state

estimatorsThe use of consistency-based state estimation in fault detec-tion is very straightforward. The existence of a fault will bedetected through the consistency-based state estimation algo-rithm based on set-computations (Algorithm 1) in the inter-section step. Then, assuming that the actual system satisfies(1) under non-faulty operating conditions, the consistency-based algorithm based on set-computations is correctly ini-tialized with the initial condition given by X0, for a given


sequence of measured inputs (uj)k−10 and outputs (yj)k

0 ofthe actual system, a fault is said to have occurred if

Xk = Xek ∩ Xyk

k = ∅ (11)

at any time instant k.

When a fault occurs, a recovery strategy is needed. Onepossibility consists resetting the set Xe

k to a size which isguaranteed to capture the true state of the system after thefault has been detected.

4.2 Fault detection using worst-case stateestimators

In this case, fault detection consists in testing whether themeasured output from the system lie within the behavior de-scribed by an observer of the faultless system. If the measure-ments are inconsistent with the predicted output provided bythe observer, the existence of a fault is proved. The residualvector usually describes the consistency check between thepredicted, y(k), and the real behaviour, y(k):

r(k) = y(k)− y(k) (12)

Ideally, the residuals should only be affected by the faults.However, the presence of disturbances, noise and modelingerrors causes the residuals to become nonzero and thus in-terferes with the detection of faults. In case of modeling adynamic system using an interval model, the predicted outputis described by a set that can be bounded at any iteration by aninterval [y(k)]. Then, fault detection test is based on propa-gating the parameter uncertainty to the residual, and checkingif

0 ∈ [r(k)] = y(k)− [y(k)] (13)

Then, no fault can be indicated. Otherwise, a fault shouldbe indicated. This test is equivalent to check if the measuredoutput belongs to interval of predicted outputs, i.e., to checkif

y(k) ∈ [y(k)] (14)

In the case of assuming bounded noise, the measurementcan be considered to be in the interval [y(k)]. Then, the pre-vious fault detection test can be restated as

[y(k)] ∩ [ ˆy(k)] 6= ∅ (15)

5 Case of study5.1 DescriptionA modified version of the benchmark problem proposed in[Chen and Patton, 1996] is considered. It consists of a lin-earized discrete-time model of a simplified longitudinal flightcontrol system given by:

xk+1 = Akxk + Bkuk + wk (16)

yk = Ckxk + vk

with:

Ak =

[ 0.994±∆a11 −0.120±∆a12 −0.430±∆a13

0.002±∆a21 0.990±∆a22 −0.074±∆a23

0 0.8187 0

]

Bk =

[ 0.4252±∆b1

−0.0082±∆b2

0.1813

], Ck = I3x3

The states variables x = [ηy ωz δz]T represent the nor-mal velocity, pitch rate, and pitch angle, respectively. Thecontrol input is an elevator control signal. The systemhas been simulated using uk=10. The covariance matricesfor process and measurement noise sequences are Qk =diag0.12, 0.12, 0.012 and Rk = 0.012I3×3. The aero-dynamic coefficients are randomly perturbed by ±20% .i.e.∆aij = −0.2aij , while the process wk and measurement vk

noises are normally distributed. The initial state vector usedin the simulation was xo = [0 0 0]T .

To implement the consistency-based and worst-case stateestimators, process/measurement noise will only be assumedto be bounded (see Eq. (1)). In this example since statisticaldistribution of noise is given, these bounds will be obtainedfrom the covariances matrices taking 3 times the standard de-viation. Then: wk = [[−0.3, 0.3], [−0.3, 0.3], [−0.03, 0.03]]and vk = [[−0.03, 0.03], [−0.03, 0.03], [−0.03, 0.03]]. Onthe other hand, parameter uncertainty is obtained by consider-ing that all parameter values in A matrix are intervals whosecenter is the nominal value and the with is ∆aij = −0.2aij .

To make the results obtained using both state estimatorscomparable, the gain K of the worst-case state estimatorswill be designed taking into account statistical distribution ofnoise. In particular, observer gain K is determined using co-variance matrices for process and measurement noise, Q andR respectively, and making use of Kalman filter theory [Chenand Patton, 1996] using steady state approximation:

K =

[ 0.9899 −0.1203 −0.43020 0.9901 −0.07470 0.0079 0

]

5.2 Fault detectionTwo different types of fault were studied to compare the be-havior of the fault detection approaches presented in this pa-per: an additive fault and a multiplicative fault. The zonotopebased fault detection methods described in this paper in allcases were initialized using as initial zonotope:

Xo = po ⊕HoBm

with:

po = [0.1 0.1 0.1]T Ho =

[ 1 0 00 1 00 0 1

]

In all cases, the zonotope order m was limited to 27.

A. Additive faultIn this scenario, an additive fault of size 2 is introduced inthe pitch angle output measurement, i.e. yk,3 = yk,3 + 2


from time instant k = 10. Figure 2 shows the three com-ponents of the output measurements and their envelopes ob-tained with the consistency approach (dashed line) and withthe worst-case method (+ marks). Looking only at the firsttwo component of the output measurement are inside of theirenvelopes, then no fault is indicated. However, the third com-ponent (the pitch angle) output measurement goes outside theenvelopes for several time instants from k = 10, then fault isdetected. Figure 3 presents sets Xe

k and Xyk

k at time k = 10using the consistency test, since their intersection is the emptyset: Xe

k ∩Xyk

k = ∅, this is why the fault has been detected us-ing this approach. Figure 4 shows the result of the worst-case fault detection test: 0 means no fault while 1 meansfault. Figure 5 presents the result for the consistency faultdetection test, in this case when the fault is detected the al-gorithm stops, i.e. the fault is detected only for time instantk = 10. From these last two figures it can be observed that us-ing the worst-case test, the persistence of the fault indicationis higher than in the consistency method. In the case of theworst-case approach the fault indication persistence dependson the observer gain K.

5 10 15 20 25 30

0

1000

2000

3000

4000

Nor

mal

vel

ocity

5 10 15 20 25

0

5

10

15

Pitc

h ra

t

5 10 15 20 25 30

0

5

10

15

Time instant, k

Pitc

h an

gle

0

0 30

0

Figure 2: System output measurements and envelopes (addi-tive fault

B. Multiplicative faultIn this scenario, a multiplicative fault of size 1 is introducedby modifying the parameter a21 of the system matrix Ak

from time instant k = 10, i.e.:

Ak =

[a11 ±∆a11 a12 ±∆a12 a13 ±∆a13

a21 + 1 a22 ±∆a22 a23 ±∆a23

a31 a32 a33

]

Figure 6 shows the evolution of the three output measure-ments and their envelopes obtained with both approaches, inthese cases all measurements go outside of their envelopes indifferent time instant since k = 10, so fault is indicated us-ing both methods. Figure 7 shows the result of the worst-casefault detection test. Figure 8 presents the result for the con-sistency fault detection test. Figure 9 presents sets Xe

k andXyk

k at time k = 10 using the consistency test, since their in-tersection is the empty set: Xe

k ∩ Xyk

k = ∅, this is why the

55 60 65 70 75 80 85 90 950.5

1

1.5

2

2.5

3

3.5

Pitch angle, x(1)

Nor

mal

vel

ocity

, x(3

)

Xyk

X k

Figure 3: Fault detected at k=10 using consistency test

0 5 10 15 20 25 30−1

0

1

2

Nor

mal

vel

ocity

res

idua

l

0 5 10 15 20 25 30−1

0

1

2P

itch

rate

res

idua

l

0 5 10 15 20 25 30−1

0

1

2

Time instant, k

Pitc

h an

gle

resi

dual

Figure 4: Worst-case fault detection for an additive fault

fault has been detected using this approach. From these lasttwo figures, it can be observed again that using the worst-casetest, the persistence of the fault indication is higher than in theconsistency method and fault indication persistence dependson the observer gain K.

6 ConclusionsIn this paper two methods for robust fault detection based onstate estimation using zonotopes are presented and compared.Both approaches use interval models to describe parameteruncertainty and assume a bounded description of process andmeasurement noise. The first approach, known as consistencybased approaches, computes a set of uncertain states that areconsistent with model uncertainty and process/measurementnoise. The second approach, known as worst-case approach,computes the worst-case estimation for each state variableconsidering the effect of parameter uncertainty and noise. Af-ter the application of both approaches to an application exam-ple based on simplified longitudinal flight control system, it


0 5 10 15 20 25 30−1

−0.5

0

0.5

1

1.5

Time instant, k

Figure 5: Consistency fault detection for an additive fault

0 5 10 15 20 25 30−2000

0

2000

4000

6000

Nor

mal

vel

ocity

0 5 10 15 20 25 30−5000

0

5000

10000

15000

Pitc

h ra

te

0 5 10 15 20 25 30−5000

0

5000

10000

Pitc

h an

gle

Time instant, k

Figure 6: System output measurements (multiplicative fault)

can be noticed that the worst-case approach offers a fault in-dication that is more persistent in time than the one providedby the consistency-based approach. This is because in theconsistency-based approach after the fault detection, the faultdetection algorithm should be stopped since an inconsistencyis detected and it is impossible to continue with the state es-timation. If these two approaches are going to be applied infault isolation, the consistency-based approach would requirea memory to keep the fault indication active after the faultis detected. So far, these two approaches to fault detectionwere always presented separately in the literature but nevercompared, being the main contribution of this paper.

References

[Alamo et al., 2005] T. Alamo, J.M. Bravo, and E.F. Cama-cho. Guaranteed state estimation by zonotopes. Automat-ica, 41(6):1035–1043, 2005.

0 5 10 15 20 25 30−1

0

1

2

Nor

m. v

el. r

esid

ual

0 5 10 15 20 25 30−1

0

1

2

P. r

ate

resi

dual

0 5 10 15 20 25 30−1

0

1

2

Time instant, k

P. a

ngle

res

idua

l

Figure 7: Worst-case fault detection for a multiplicative fault

0 5 10 15 20 25 30−1

−0.5

0

0.5

1

1.5

Time instant, k

Figure 8: Consistency fault detection for a multiplicative fault

−40 −20 0 20 40 60 80

−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

Normal velocity, x(1)

Pitc

h ra

te, x

(2)

X yk

X k

Figure 9: Fault detected at k=10 using consistency test


[Calafiore, 2001] G. Calafiore. A set-valued non-linear filterfor robust localization. In European Control Conference(ECC’01), Porto, Portugal, 2001.

[Chen and Patton, 1996] J. Chen and R.J. Patton. Optimalfiltering and robust fault diagnosis of stochastic systemswith unknown disturbances. Control Theory and Applica-tions, IEE Proceedings, 143 (1):31 – 36, 1996.

[Chen and Patton, 1999] J. Chen and R.J. Patton. Ro-bust Model-Based Fault Diagnosis for Dynamic Systems.Kluwer Academic Publishers, 1999.

[Chisci et al., 1996] L. Chisci, A. Garulli, and G. Zappa.Recursive state bounding by parallelotopes. Automatica,32:1049–1055, 1996.

[Combastel, 2003] C. Combastel. A state bounding observerbased on zonotopes. In European Control Conference(ECC’03), Cambridge, UK, 2003.

[ElGhaoui and Calafiore, 1997] L. ElGhaoui andG. Calafiore. Robust filtering for discrete-time systemswith bounded noise and parametric uncertainty. IEEETransactions on Automatic Control, 46(7):1084–1089,1997.

[Gertler, 1998] J. Gertler. Fault Detection and Diagnosis inEngineering Systems. Marcel Dekker, 1998.

[Kieffer et al., 2002] M. Kieffer, L. Jaulin, and E. Walter.Guaranteed recursive non-linear state bounding using in-terval analysis. International Journal of Adaptive Controland Signal Processing, 16 (3):193 – 218, 2002.

[Kuhn, 1998] W. Kuhn. Rigorously computed orbits of dy-namical systems vithout the wrapping effect. Computing,61(1), 1998.

[Maksarov and Norton, 1996] D. Maksarov and J.P. Norton.State bounding with ellipsoidal set description of the un-certainty. International Journal of Control, 65 (5):847 –866, 1996.

[Milanese et al., 1996] M. Milanese, J. Norton, H. Piet-Lahanier, and E. Walter, editors. Bounding Approachesto System Identification. Plenum Press, 1996.

[Moore, 1966] R.E. Moore. Interval analysis. Prentice Hall,1966.

[Polyak et al., 2004] B.T. Polyak, S.A. Sergey, A. Nazin,C. Durieu, and E. Walter. Ellipsoidal parameter orstate estimation under model uncertainty. Automatica,40(7):1171–1179, 2004.

[Puig et al., 2001] V. Puig, P. Cuguero, and J. Quevedo.Worst-case estimation and simulation of uncertaindiscrete-time systems using zonotopes. In Proceedings ofEuropean Control Conference, 2001.

[Puig et al., 2002a] V. Puig, P. Cuguero, , and J. Quevedo.Time-invariant approach to set-membership simulationand state observation for discrete-time invariant systemswith parametric uncertainty. In Proceedings of the 41thIEEE Conference on Decision and Control, 2002.

[Puig et al., 2002b] V. Puig, J. Quevedo, and T. Escobet. Ro-bust fault detection approaches using interval models. InIFAC World Contgress (b’00), Barcelona, Spain, 2002.

[Puig et al., 2005] V. Puig, A. Stancu, and J. Quevedo. Ob-servers for interval systems using set and trajectory-basedapproaches. In Proceedings of European Control Confer-ence and IEEE Conference on Decision and Control, 2005.

[Rinner and Weiss, 2004] B. Rinner and U. Weiss. Onlinemonitoring by dynamically refining imprecise models.34:1811 – 1822, 2004.

[Shamma, 1997] J.S. Shamma. Approximate set-value ob-server for nonlinear systems. IEEE Transactions on Auto-matic Control, 42(5):648–658, 1997.

[Witczak et al., 2002] M. Witczak, J. Korbicz, and R.J. Pat-ton. A bounder-error approach to designing unknown in-put observers. In IFAC World Contgress (b’02), Barcelona,Spain, 2002.


!"#"$ %& '#"( !")*$+ ,$-.-%/ ,"0&12 3 2(" $ 4"1+1 56789 .2 :"#;<2(="0>="0?@2">A

/B2("#(CD EFGH IJIKLM NK JLK GDEKLKHEKO GD EFK OGJPQDRHGH RS OGHTLKEK KUKDE HVHEKWH WROKXKO YV ZQDGEK ELJDHGEGRD HVHEKWH[ \K ILRIRHK J WROKXRS H]IKLUGHGRD IJEEKLDH PKDKLJX KDR]PF ER TJIQE]LK IJHE RTT]LLKDTKH RS IJLEGT]XJL ELJ^KTERLGKHRS EFK HVHEKW[ _ROKXGDP EFK OGJPDRHGH RY^KTQEGUK YV H]IKLUGHGRD IJEEKLDH JXXRNH ]H ER PKDQKLJXG`K EFK ILRIKLEGKH ER YK OGJPDRHKO JDO ERLKDOKL EFKW GDOKIKDOKDE RS EFK OKHTLGIEGRD RSEFK HVHEKW[ \K ZLHE SRLWJXXV OKZDK EFK OGJPQDRHGH ILRYXKW GD EFGH TRDEKaE[ \K EFKD OKLGUKEKTFDGb]KH SRL EFK TRDHEL]TEGRD RS J OGJPDRHKLJDO SRL EFK UKLGZTJEGRD RS EFK OGJPDRHEGTJYGXGQEV YJHKO RD HEJDOJLO RIKLJEGRDH RD ELJDHGEGRDHVHEKWH[ \K HFRN EFJE EFKHK EKTFDGb]KH JLKPKDKLJX KDR]PF ER KaILKHH JDO HRXUK GD J ]DGQZKO NJV J YLRJO TXJHH RS OGJPDRHGH ILRYXKWHSR]DO GD EFK XGEKLJE]LKM K[P[ OGJPDRHGDP IKLQWJDKDE SJ]XEHM W]XEGIXK SJ]XEHM SJ]XE HKb]KDTKHJDO HRWK ILRYXKWH RS GDEKLWGEEKDE SJ]XEH[

c -($1#(dGJPDRHGDP JDO WRDGERLGDP OVDJWGTJX HVHEKWH GH JDGDTLKJHGDPXV JTEGUK LKHKJLTF ORWJGD JDO WROKXQYJHKOJIILRJTFKH FJUK YKKD ILRIRHKO NFGTF OGeKL JTTRLOGDPER EFK fGDO RS WROKXH EFKV ]HKO ghij hj hhj hkj ijlm[ nFK PKDKLJX OGJPDRHGH ILRYXKW GH ER OKEKTE RL GOKDQEGSV IJEEKLDH RS IJLEGT]XJL KUKDEH RD J IJLEGJXXV RYHKLUQJYXK HVHEKW[ nFGH IJIKL SRT]HKH RD OGHTLKEKQKUKDE HVHQEKWH WROKXKO JH ZDGEK HEJEK WJTFGDKH[ CD EFGH TRDEKaEMIJEEKLDH ]H]JXXV OKHTLGYK EFK RTT]LLKDTK RS J SJ]XE ghojhimM W]XEGIXK RTT]LLKDTKH RS J SJ]XE gpmM EFK LKIJGL RS JHVHEKW JSEKL EFK RTT]LLKDTK RS J SJ]XE gom[ nFK JGW RS OGQJPDRHGH GH ER OKTGOKM YV WKJDH RS J qrstuvwxyM NFKEFKL RLDRE H]TF J IJEEKLD RTT]LLKO GD EFK HVHEKW[ zUKD GS H]TFJ OKTGHGRD TJDDRE YK EJfKD GWWKOGJEKXV JSEKL EFK RTT]LQLKDTK RS EFK IJEEKLDM RDK LKb]GLKH EFJE EFGH OKTGHGRD FJHER YK EJfKD GD J YR]DOKO OKXJV[ nFGH ILRIKLEV GH ]H]JXQXV TJXXKO qrstuvwsr|r~[ nFGH ILRIKLEV TJD YK TFKTfKO s ¡¢£¤¥

¦yrvyr SLRW EFK HVHEKW WROKXM JDO OKIKDOH RD YREF GEHRYHKLUJYGXGEV JDO EFK H]IKLUGHGRD IJEEKLD[§RNKUKLM EFK JIILRJTFKH GD EFK XGEEKLJE]LK H]eKL SLRWHRWK OKZTGKDTGKH[ ¨DK RYHKLUKH WJDV OGeKLKDE OKZDGQEGRDH RS OGJPDRHJYGXGEV JDO sq ©vª JXPRLGEFWH SRL EFK TRDQHEL]TEGRD RS EFK OGJPDRHKLM JH NKXX JH SRL EFK UKLGZTJEGRDRS OGJPDRHJYGXGEV[ «H J TRDHKb]KDTKM JXX EFKHK LKH]XEH JLKOG¬T]XE ER LK]HK SRL DKN Y]E HGWGXJL OGJPDRHGH ILRYXKWQH[ \K YKXGKUK EFJE EFK LKJHRD TRWKH SLRW JD JYHKDTK RSJ TXKJL OKZDGEGRD RS EFK GDURXUKO IJEEKLDHM NFGTF NR]XOTXJLGSV EFK HKIJLJEGRD YKENKKD EFK OGJPDRHGH RY^KTEGUKJDO EFK HIKTGZTJEGRD RS EFK HVHEKW[CD EFGH IJIKLM NK SRLWJXXV GDELRO]TK EFK DREGRD RS H]QIKLUGHGRD IJEEKLD JH J WKJDH ER OKZDK EFK OGJPDRHGH RYQ^KTEGUKH J H]IKLUGHGRD IJEEKLD GH JD J]ERWJERD NFGTFXJDP]JPK GH EFK HKE RS ELJ^KTERLGKH RDK NJDEH ER OGJPDRHK[nFK ILRIRHJX GH PKDKLJX KDR]PF ER TRUKL GD JD ]DGZKONJV JD GWIRLEJDE TXJHH RS OGJPDRHGH RY^KTEGUKHM GDTX]OGDPOKEKTEGRD RS IKLWJDKDE SJ]XEHM Y]E JXHR ELJDHGKDE SJ]XEHMW]XEGIXK SJ]XEHM LKIKJEGDP SJ]XEHM JH NKXX JH b]GEK TRWIXKaHKb]KDTKH RS KUKDEH[\K EFKD ILRIRHK J SRLWJX OKZDGEGRD RS EFK dGJPDRHGH®LRYXKW GD EFGH TRDEKaE[ nFK KHHKDEGJX IRGDE GH J TXKJLOKZDGEGRD RS EFK HKE RS ELJ^KTERLGKH ªv ¦sr|x NGEF JD RYQHKLUKO ELJTK[ °RNM EFK dGJPDRHGH ®LRYXKW GH KaILKHHKOJH EFK ILRYXKW RS HVDEFKHG`GDP J S]DTEGRD RUKL ELJTKHMEFK qrstuvwxyM NFGTF OKTLKKH RD EFK IRHHGYXK±TKLEJGD RTQT]LLKDTK RS EFK IJEEKLD RD ELJ^KTERLGKH TRWIJEGYXK NGEFEFK ELJTK[ nFK OGJPDRHKL GH LKb]GLKO ER S]XZX ENR S]DQOJWKDEJX ILRIKLEGKH ªvyyxªuxww JDO v²uqxq qrstuvw³sr|r~[ ´vyyxªuxww KaILKHHKH EFJE EFK OGJPDRHKL JDQHNKLH JTT]LJEKXV JDO µv²uqxq ¶rstuvwsr|r~ P]JLJDEKKHEFJE RDXV J YR]DOKO D]WYKL RS RYHKLUJEGRDH GH DKKOQKO ER KUKDE]JXXV JDHNKL NGEF TKLEJGDEV EFJE EFK IJEEKLDFJH RTT]LLKO[ µv²uqxq ¶rstuvwsr|r~ GH SRLWJXXV OKQZDKO JH EFK ·QOGJPDRHJYGXGEV RS EFK HVHEKW ¸NFKLK · GHEFK H]IKLUGHGRD IJEEKLD¹M NFGTF TRWIJLKH ER HEJDOJLOOGJPDRHJYGXGEV YV ghim[ ºKXVGDP RD EFK SRLWJX SLJWKQNRLf NK FJUK OKUKXRIIKOM NK EFKD ILRIRHK JXPRLGEFWHSRL YREF EFK OGJPDRHKL»H HVDEFKHGHM JDO EFK UKLGZTJEGRDRS ·QOGJPDRHJYGXGEV[ \K YKXGKUK EFJE EFKHK PKDKLGT JXPRQLGEFWH JH NKXX JH EFKGL TRLLKTEDKHH ILRRSH JLK J XRE WRLKHGWIXK EFJD EFK RDKH ILRIRHKO GD EFK XGEEKLJE]LK[nFK IJIKL GH RLPJDG`KO JH SRXXRNH[ CD HKTEGRD oM NK LKQ


TJXX HEJDOJLO OKZDGEGRDH JDO DREJEGRDH RD XJYKXKO ELJDQHGEGRD HVHEKWHM JH NKXX JH EFK DREGRD J TRWIJEGYXK ELJQ^KTERLGKH RS JD RYHKLUJYXK ELJTK[ ]IKLUGHGRD IJEEKLDHJLK GDELRO]TKO GD HKTEGRD i[ nFK OGJPDRHGH ILRYXKW JDOEFK ·QOGJPDRHJYGXGEV JLK EFKD OKZDKO[ KTEGRD l GH OKOQGTJEKO ER JXPRLGEFWH JDO EFKGL JHHRTGJEKO ILRRSHM SRL EFKTRDHEL]TEGRD RS J TRLLKTE OGJPDRHKL JH NKXX JH EFK UKLGZQTJEGRD RS ·QOGJPDRHJYGXGEV[ GDJXXVM KTEGRD GXX]HELJEKHEFK JIILRJTF NGEF JD KaJWIXK[9 "B++$ "2( %2(02 "$

.+"($ =(2\K ZLHE LKTJXX ]HKS]X HEJDOJLO DREJEGRDH \K JHH]WKPGUKD JD JXIFJYKE M EFJE GH J ZDGEK HKE [ nFKHKE RS ZDGEK HKb]KDTKH RUKL GH OKDREKO YV M NGEF SRLEFK KWIEV HKb]KDTK[ CD EFK IJIKLM EVIGTJX KXKWKDEH RS JLK [ RL JDO GDM OKDREKH EFK ªvuªsxusrvuv suq nFK |xut© RS GH OKDREKO [\K DRN TRWK ER EFK WROKXH RS HVHEKWH !"# $% &'( )*+, -. /+0%+/ 12 3 456789+ : ¸;<=>¹ ?@+,+ ; -. 3 0%-6+ .+6 )A .636+.?-6@ 3 /-.6-%B7-.@+/ +9+C+%6 => D399+/ EFK GDGEGJX HEJEKE -. 6@+ .+6 )A KUKDEH )A :E => ; -. 6@+ -%-6-39 .636+E3%/ <F ; G G; -. EFK IJLEGJX ELJDHGEGRD LKXJEGRDHCD EFK LKHE RS EFK HKTEGRDM NK JHH]WK PGUKD JD In : ¸;<=J¹[ \K NLGEK = K< = SRL ¸==¹ < JDO= L< SRL M= ;= L< = [ \K KaEKDO < ER HKb]KDTKH YVHKEEGDP = N< = JXNJVH FRXOHM JDO = LK< = NFKDKUKL = L<= JDO = K< = M SRL HRWK = ;[ nFK xOxu wx RS J HEJEK= ; GH ¸=¹ P Q = K<[ « HEJEK = GH yxsª©s|x GSM =R L< = [ \K HKE ST ¸= ¹ P = ; Q = L< =[CD IJLEGT]XJL ST ¸= ¹ P =[ UV JY]HK RS DREJEGRDM SRL JXJDP]JPK V F M ST ¸=V¹ P = ; Q M V= L< =MJDO SRL JDV ; F ;M ST ¸;V¹ WXYZ[ ST ¸=V¹[ «H]YHKE ; F ; GH ws|x NFKDKUKL ST ¸;¹ F ; [ :GH s|rOx GS \= ;¸=¹ ] ^[ : GH ªv ¦|xx GS \= ;¸=¹ [ : GH qxxy¯rurwrª GS SRL JDV = ; JDO M = K< = _ = K< = = = [nFK |sut²stx txuxysxq YV EFK HVHEKW : GH a¸:¹ P =b L< NFGTF KXKWKDEH JLK TJXXKO ysc xªvyrxw RS: [ dGUKD J ELJ^KTERLV a¸:¹M NK NLGEK

a¸:¹e P Q a¸:¹SRL EFK HKE RS ELJ^KTERLGKH EFJE KaEKDO GD : [ºJIGOXV GD EFK IJIKLM NK NGXX DKKO ER OGHEGDP]GHF JH]YHKE ; F ; ER OKDREK ZDJX HEJEKH[ nFK DREGRDH JYRUKJLK KaEKDOKO GD EFGH HKEEGDP YV XKEEGDP aZf ¸:¹ Q ST ¸=R ¹ F ;[« ]HKS]X RIKLJEGRD RD In GH EFK HVDTFLRDR]H ILRO]TEEFJE JXXRNH ER GDEKLHKTE XJDP]JPKH RS ENR InH[

g &+6 :h ¸;h=hR <h¹E i hoE 1+ 6?)&'(.H '@+-, HVDTFLRDR]H ILRO]TE -. : G:j ¸; G;j ¸=R =jR ¹ k j<¹E ?@+,+ <F ; G;j .36-.0+.¸= =j¹ K< ¸= =j¹ ?@+%+*+, = K< = 3%/ =j K<j =j H

lXKJLXV a¸: G:j¹ a¸:¹ma¸:j¹ JDO SRL ; F; JDO ;j F ;j M NK JXHR FJUK aZnoZp ¸: G:j¹ aZn ¸:¹ m aZp ¸:j¹[ «XHRM GS ENR HKEH ; F ; JDO;j F ;j JLK HEJYXKM ; G;j GH HEJYXK GD : G:j [«H NK JLK GDEKLKHEKO GD OGJPDRHGDP HVHEKWH Q EFGH NGXXYK SRLWJXG`KO GD EFK DKaE HKTEGRD QM IJLEGJX RYHKLUJEGRDIXJVH J TKDELJX LqRXK[ CD EFGH LKPJLOM EFK HKE RS KUKDEH GH IJLEGEGRDKO GDER b JDO rb ¸ b k rb NGEFb m rb ^¹ NFKLK b LKILKHKDEH EFK HKE RS vwxyOs|xKUKDEH Q KXKWKDEH RS rb JLK EFKD ²uvwxyOs|x KUKDEH[nVIGTJX KXKWKDEH RS b NGXX YK OKDREKO YV ss [\K HJV EFJE : GH b³s|rOx GS \= ;M b= L<MWKJDGDP EFJE EFKLK GH DR EKLWGDJX XRRI RS ]DRYHKLUJYXKKUKDEH[ °REGTK EFJE NFKD: FJH DR XRRI RS ]DRYHKLUJYXKKUKDEHM: GH JXGUK GS JDO RDXV GS : GH bQJXGUK[IKE t < b YK EFK DJE]LJX ¦yvc xªrvu RS ELJ^KTQERLGKH RDER b OKZDKO YV t ¸¹ JDO t ¸ ¹ t ¸ ¹GS bM JDO t ¸ ¹ REFKLNGHK[ nFK ILR^KTEGRD tHGWIXV KLJHKH EFK ]DRYHKLUJYXK KUKDEH SLRW J ELJ^KTERQLV[ t KaEKDOH ER XJDP]JPKH YV OKZDGDPM SRL V F Mt ¸V¹ t ¸ ¹ Q V[ nFK ruOxywx ¦yvc xªrvu RS V GHOKZDKO YV tu¸V¹ Q t ¸ ¹ V[°RNM EFK |sut²stx v ysªxw v : GH

vwxyz ¸:¹ P t ¸a¸:¹¹CE GH EFK HKE RS RYHKLUJYXK HKb]KDTKH RS GEH ELJ^KTERLGKH[LRW EFK ILR^KTEGRD t M NK OKLGUK JD Kb]GUJXKDTK LKXJQEGRD YKENKKD ELJ^KTERLGKH RS : M NLGEEKD T M TJXXKO EFK

¶x|s~³|wxyOsrvu x²rOs|xuªx GD LKSKLKDTK ER EFK OKXJVQYGHGW]XJEGRD RS g~m T#&+6 TF a¸:¹ Ga¸:¹ 1+ 6@+ 1-%3,2 ,+936-)% /+0%+/12 T ?@+%+*+, t ¸ ¹ t ¸ ¹ 3%/ b -A 3%/ )%92 -A bH

%+ +3.-92 *+,-0+. 6@36 T -. 3% +7-*39+%D+ ,+936-)%E3%/ ?+ 63+ 6@+ D)%*+%6-)% 6) ?,-6+ g m A), 6@+ +7-*359+%D+ D93.. )A HdGUKD a¸:¹M DJE]LJXXV WJIH RDER J ELJTK RS

: M DJWKXV t ¸ ¹[ °RNM PGUKD J DRD KWIEV ELJTK s RS: M s ORKH DRE ]DGb]KXV OKEKLWGDK J dKXJVQ¨YHKLUJEGRDKb]GUJXKDTK TXJHH JH GD PKDKLJX s TJD YK YLR]PFE YJTf GD: GD ENR OGeKLKDE WJDDKLH s TJD YK JHHRTGJEKO NGEFEFK TXJHH g m NGEF t ¸ ¹ s JDO b M RL s TJDYK JHHRTGJEKO NGEF EFK TXJHH g m NGEF t ¸ ¹ s JDO rb °REGTK EFJE YV dKZDGEGRD iM g m JDO g m JLKOGeKLKDE[ §KDTKSRLEFM NK EJfK EFK TRDUKDEGRD EFJE EFKKb]GUJXKDTK TXJHH OKDREKO YV J ELJTK s GH

ggsmmT P tu¸s¹ m a¸:¹ m b GS s ] gm REFKLNGHK

\K HJV EFJE ggsmmT GH EFK HKE RS ELJ^KTERLGKH ªv ¦sr|xNGEF EFK ELJTK s[ \FKD TXKJL SLRW EFK TRDEKaEM NK NGXX]HK ggsmm SRL ggsmmT [ nFGH DREGRD RS TRWIJEGYXK ELJ^KTERLVNGXX YK J TKDELJX DREGRD SRL OGJPDRHGH JH EFK JGW NGXXYK ER GDSKL ILRIKLEGKH RD EFK HKE RS ELJ^KTERLGKH ggsmmT


TRWIJEGYXK NGEF EFK RYHKLUJEGRD RS EFK ELJTK s[ nFKLKJHRD SRL TFRRHGDP EFGH OKZDGEGRD RS ggsmm GH EFJE GD EFKTJHK RS RDXGDK OGJPDRHGHM GE GH DJE]LJX ER JHH]WK EFJE EFKOGJPDRHKL GH LKJTEGUK ER JD RYHKLUJYXK WRUK RS EFK HVHEKW[

5 %1& 2 '"((2 "$ ("22 'B+0

CD EFGH HKTEGRDM NK GDELRO]TK EFK DREGRD RS w²¦xyOrwrvu¦sxyuwM NFGTF JLK WKJDH ER OKZDK XJDP]JPKH NK JLKGDEKLKHEKO GD SRL OGJPDRHGH I]LIRHK[ \K EFKD PGUK HRWKKaJWIXKH RS H]TF IJEEKLDH[ GDJXXVM NK GDELRO]TK EFKOGJPDRHGH ILRYXKW SRL H]TF IJEEKLDH[ ]IKLUGHGRD IJEEKLDH JLK LKILKHKDEKO YV IJLEGT]XJL InH $ H]IKLUGHGRD IJEEKLD -. 3 56789+ · ¸; < =R ; ¹E ?@+,+ ¸; < =R¹ -. 3 /+56+,C-%-.6-D 3%/ D)C89+6+ &'(E 3%/ ; F ; -. 3 /-.56-%B7-.@+/ .6319+ .71.+6 )A .636+.H«H · GH TRWIXKEK NK PKE a¸·¹ [ «XHR DREGTK

EFJE EFK JHH]WIEGRD EFJE ; GH HEJYXK WKJDH EFJE GEQH JTTKIEKO XJDP]JPK GH KaEKDHGRDQTXRHKOM G[K[ HJEGHZKHaZ ¸·¹ aZ ¸·¹[ ¨EFKLNGHK HJGOM aZ ¸·¹ GH J XJDQP]JPK UGRXJEGDP J HJSKEV ILRIKLEV[ nFGH TFRGTK GH DJE]LJXHGDTK NK NJDE ER OGJPDRHK NFKEFKL JXX ELJ^KTERLGKH TRWQIJEGYXK NGEF JD RYHKLUKO ELJTK FJUK J ILKZa LKTRPDG`KOYV EFK IJEEKLD[ \K DRN PGUK KaJWIXKH RS H]IKLUGHGRD IJEEKLDH NFGTFLKIFLJHK HEJDOJLO ILRIKLEGKH RS GDEKLKHE SRL OGJPDRHGH[

IK ! YK J SJ]XE JDOTRDHGOKL EFJE NK JLK GDEKLKHEKO GD OGJPDRHGDP EFK RTQT]LLKDTK RS EFGH SJ]XE[ « ELJ^KTERLV GH SJ]XEVGS ! [ nFK H]IKLUGHGRD IJEEKLD ·" RS GP]LK hKaJTEXV LKTRPDG`KH EFGH XJDP]JPKM a ¸·" ¹ ! [

# $% &$'()*

GP]LK h ]IKLUGHGRD IJEEKLD SRL RDK SJ]XE

+, IKE ! JDO !j YKENR SJ]XEH EFJE WJV RTT]L GD EFK HVHEKW[ dGJPDRHQGDP EFK RTT]LLKDTK RS EFKHK ENR SJ]XEH GD JD ELJ^KTERQLV WKJDH OKTGOGDP EFK WKWYKLHFGI RS EFGH ELJ^KTERLVGD ! m !j an ¸·"n¹ m ap ¸·"p ¹M NFKLK·"- i ho JLK GHRWRLIFGT ER EFK H]IKLUGHGRD IJEEKLD·" OKHTLGYKO GD GP]LK h[ nFK H]IKLUGHGRD IJEEKLD GHEFKD EFK ILRO]TE ·"n G·"p NFGTF JTTKIEKO XJDP]JPK GD. G.j GH anop ¸·"n G·"p ¹ an ¸·"n¹ map ¸·"p ¹[

$$'()/*

$'()/0)1*#0#

&/0&1

#0&1

&/0#

$'()1*)/

)1)/

)1

_RLK PKDKLJXXVM EFK H]IKLUGHGRD IJEEKLD SRL EFK RTQT]LLKDTK RS J HKE RS SJ]XEH ! 2 2 2 !3 GH EFK ILRO]TEGh4566653·"- M TRDHGOKLGDP Gh4566653.h JH ZDJX HEJEK HKE[

77 CS RDK GH GDEKLKHEKOGD OGJPDRHGDP OGeKLKDE SJ]XEH GD J ILKTGHK RLOKLM SRL KaQJWIXKM JD RTT]LLKDTK RS !j JSEKL JD RTT]LLKDTK RS !MEFK H]IKLUGHGRD IJEEKLD HFR]XO LKTRPDG`K EFK ELJ^JTERLGKHGD an ¸·"n¹ap ¸·"p ¹ !!j JH OKHTLGYKO YVEFK SRXXRNGDP H]IKLUGHGRD IJEEKLD

# &/ &)/ )1$'()/* $'()1* $

CS ! TRLLKHIRDOH ER J SJ]XE KUKDE JDO !j ER EFK LKIJGL RSEFGH SJ]XE GD EFK HVHEKWM EFKD NK JTE]JXXV OGJPDRHK EFKLKIJGL RS EFK SJ]XE ![ \GEF EFGH IJEEKLDM EFK JGW GH ERWJETF EFK 8³qrstuvwsr|r~ GD gom[9, : + «DREFKLGDEKLKHEGDP ILRYXKW GH ER OGJPDRHK EFK W]XEGIXK RTT]LQLKDTKH RS EFK HJWK SJ]XE KUKDE ! M HJV ; EGWKH[ nFGH TRLQLKHIRDOH ER EFK H]IKLUGHGRD IJEEKLD PGUKD YKXRN NFGTFJTTKIEKO XJDP]JPK GH a ¸·" ¹< [ nFK JGW GH ER WJETF EFK;QOGJPDRHJYGXGEV RS gpm[

# ) &/ )$'()* $'()* $'()*

&1 &=>/)

$'()* $&

nFGH TJD YK KJHGXV PKDKLJXG`KO ER J IJEEKLD LKTRPDG`GDPEFK RTT]LLKDTK RS ; IJEEKLDH GOKDEGTJX RL DRE¹[?+ @ nFK SRXXRNGDP H]IKLUGHGRD IJEQEKLD OKHTLGYKH EFK SJTE EFJE J SJ]XE ¸RTT]LLKDTK RS !¹RTT]LLKO ENGTK NGEFR]E LKIJGL ¸RTT]LLKDTK RS w¹[

&/)#$'()*

A)

$'()0A* &$

nFGH TJD YK PKDKLJXG`KO ER J IJEEKLD LKTRPDG`GDP EFK RTQT]LLKDTK RS ; SJ]XEH GOKDEGTJX RL DRE¹ NGEFR]E LKIJGL[ BC DE FCD EFK LKWJGDOKL RS EFK IJIKLM NK TRDHGOKL J HVHEKWNFRHK YKFJUGRL GH WROKXKO YV JD In G ¸;<=J¹[nFK RDXV JHH]WIEGRD WJOK RD G GH EFJE G GH bQJXGUK[°REGTK EFJE G TJD YK DRDQOKEKLWGDGHEGT[ \K JXHR TRDHGOQKL J H]IKLUGHGRD IJEEKLD · ¸; < =R ; ¹ OKQDREGDP EFK XJDP]JPK aZ ¸·¹ EFJE NK NJDE ER OGJPDRHK[\K OKZDK EFK ¶rstuvwrw Hyv|x¯ JH EFK ILRYXKW RSOKZDGDP J S]DTEGRD IJKL RD ELJTKH NFRHK GDEKDEGRD GH


ER JDHNKL EFK b]KHEGRD NFKEFKL ELJ^KTERLGKH TRLLKHIRDOQGDP ER RYHKLUKO ELJTKH JLK LKTRPDG`KO RL DRE YV EFK H]QIKLUGHGRD IJEEKLD[ \K OR LKb]GLK HRWK ILRIKLEGKH SRLIJKL ´vyyxªuxww JDO µv²uqxq ¶rstuvwsr|r~[ ´vy³yxªuxww WKJDH EFJE KH JDO °R JDHNKLH HFR]XO YKJTT]LJEKM NFGXK µv²uqxq ¶rstuvwsr|r~ WKJDH EFJE ELJQ^KTERLGKH GD aZ ¸·¹ HFR]XO YK OGJPDRHKO NGEF ZDGEKXVWJDV RYHKLUJEGRDH[nFK ¶rstuvwrw ¦yv|x¯ TJD YK HEJEKO JH SRXXRNH PGUQKD JD In G JDO PGUKD J H]IKLUGHRLV IJEEKLD ·M OKTGOKNFKEFKL EFKLK KaGHEH ¸JDO TRWI]EK GS JDV¹ J EFLKK UJXQ]KO S]DTEGRD IJKL vwxyz ¸G¹< z °¨ OKTLKKGDPM SRL KJTF ELJTK s RS GM RD EFK WKWYKLHFGI GDaZ ¸·¹ RS JDV ELJ^KTERLV GD ggsmm[ RLWJXXVM ¸dGJPDRHGH lRLLKTEDKHH¹ nFK S]DTEGRD HFR]XO UKLGSV

IJKL¸s¹ z GS ggsmm FaZ ¸·¹°¨ GS ggsmm maZ ¸·¹ ^

REFKLNGHK[ ¸UR]DOKO dGJPDRHJYGXGEV¹ «H G GH RDXV IJLEGJXXVRYHKLUKOM NK KaIKTE GD PKDKLJX HGE]JEGRDH NFKLK

IJKL ¸s¹ ¸JH DKGEFKL ggsmm FaZ ¸·¹ DRL ggsmmmaZ ¸·¹ ^ FRXO¹[ §RNKUKLM NK LKb]GLK EFGH ]DOKQEKLWGDKO HGE]JEGRD DRE ER XJHE GD EFK SRXXRNGDP HKDHKnFKLK W]HE KaGHE M EFK YR]DOM H]TF EFJE NFKDQKUKL ggsmm m aZ ¸·¹M SRL JXX a¸G¹e m b MGS t ¹ EFKD IJKL ¸t ¸ ¹¹ z[dGJPDRHGH lRLLKTEDKHH WKJDH EFJE EFK OGJPDRHGH RS JELJTK s GH °R GS DR ELJ^KTERLV GD GEH HKWJDEGTH ggsmm XGKHGD aZ ¸·¹ NFGXK GE GHKH GS JXX ELJ^KTERLGKH GD ggsmm XGK GDaZ ¸·¹[ UR]DOKO dGJPDRHJYGXGEV WKJDH EFJE NFKD RYQHKLUGDP J ELJ^KTERLV GD aZ ¸·¹M J KH JDHNKL HFR]XOYK ILRO]TKO JSEKL ZDGEKXV WJDV RYHKLUJYXK KUKDEH ¸KKGP]LK o SRL JD GDE]GEGUK KaIXJDJEGRD RS EFKHK DREGRDH¹[

f

No No No No ? Yes? ? ?

trajectoriesCompatible

f

f

GP]LK o nFK ·QOGJPDRHJYGXGEV SRL · ·"

°RNM GS IJKL ILRUGOKH J lRLLKTE dGJPDRHGHM UR]DOQKO dGJPDRHJYGXGEV TJD YK LKIFLJHKOM YV LKIXJTGDPIJKL ¸t ¸ ¹¹ z NGEF ggt ¸ ¹mm F aZ ¸·¹[ \KRYEJGD NFJE NK TJXX EFK ·³qrstuvwsr|r~[ °REGTK EFJEEFGH GH DRN J ILRIKLEV RS G NGEF LKHIKTE ER ·[ $% &'(G -. ·¸¹QOGJPDRHJYXKE ?@+,+ E ?@+%+*+,\ aZ ¸·¹ ma¸G¹ m b\ a¸G¹e m b-A t ¹ 6@+% ggt ¸ ¹mm FaZ ¸·¹ ¸h¹+ .32 6@36 G -. ·QOGJPDRHJYXK -A G -. ·¸¹QOGJPDRHJYXK A), .)C+ H

·QOGJPDRHJYGXGEV HJVH EFJE NFKD J ELJ^KTERLV KDOGDPNGEF JD RYHKLUJYXK KUKDE GH LKTRPDG`KO YV EFK H]IKLUGHGRDIJEEKLD ·M SRL JDV KaEKDHGRD NGEF KDR]PF RYHKLUJYXKKUKDEHM JDV ELJ^KTERLV TRWIJEGYXK NGEF EFK RYHKLUJEGRDt ¸ ¹ GH JXHR LKTRPDG`KO YV ·[nFK LKWJLf YKSRLK dKZDGEGRD GH SRLWJXG`KO YV , A IJKL D)C876+. 3 ),,+D6 -3B%)5.-.E 6@+% G -. ·5/-3B%).319+ -A 3%/ )%92 -A 6@+ )7%/+/-3B%).31-9-62 ,)8+,62 @)9/. A), IJKL H«H ER HFRN EFK ]DGSVGDP SLJWKNRLf YJHKO RD H]IKLQUGHGRD IJEEKLDHM NK FKLK TRDHGOKL EFK UKLV IJLEGT]XJL H]QIKLUGHGRD IJEEKLD ·" RS KTEGRD i[oM RLGPGDJXXV TRDHGOKLKOYV ghoj him NGEF EFK JHHRTGJEKO DREGRD RS !QOGJPDRHJYGXGEV[IKE ]H ZLHE LKTJXX EFGH DREGRD[IKE G YK JD In NFGTF GH JXGUK JDO FJH DR XRRI RS]DRYHKLUJYXK KUKDE[ G GH ! ³qrstuvws|x NFKDKUKLM \ ! \ a¸G¹e GS EFKD \ a¸G¹t ¸¹ t ¸ ¹` ! ¸o¹nFK SRXXRNGDP ILRIRHGEGRD LKXJEKH !QOGJPDRHJYGXGEVNGEF ·"QOGJPDRHJYGXGEV, g &+6 G 1+ 3% &'( 3%/ 3..7C+ 6@36 G

-. 39-*+ 3%/ @3. %) 9))8 )A -%6+,%39 +*+%6.H '@+% G -.! 5/-3B%).319+ -A 3%/ )%92 -A G -. ·" 5/-3B%).319+H \K ZLHE WJfK EFK SRXXRNGDP LKWJLfH# ! GH Kb]GUJXKDE ER aZ ¸·" ¹j# ! GWIXGKH aZ ¸·" ¹j«HH]WK G GH !QOGJPDRHJYXKM JDO EFJE S]XZXXH¸o¹[ \K ILRUK EFJE ·" ¸ ¹QOGJPDRHJYGXGEV FRXOH TRDHGOKL aZ ¸·" ¹ m b JDO XKE a¸G¹e m b NGEFt ¹ j DREK EFJE EFKLKSRLK [ CE GH KJHV ERHFRN EFJE OKTRWIRHKH GDER NFKLK ! MNGEF JOOGEGRDJXXV a¸G¹e [ °RNM GWIXGKH

M NFGTFM YV ¸o¹M KDEJGXH EFJE SRL JDV a¸G¹NGEF t ¸¹ t ¸ ¹M NK FJUK ! aZ ¸·" ¹[nFGH GWIXGKH GD IJLEGT]XJL EFJE ggt ¸ ¹mm FaZ ¸·" ¹[ºKTGILRTJXXVM JHH]WK G GH ·" ¸¹QOGJPDRHJYXKM SRL HRWK[ IKE YK EFK XKDPEF RS EFK XRDPKHE ]DRYHKLUJYXK ELJQ^KTERLV GD G ¸NFGTF KaGHEH YV JHH]WIEGRD¹M JDO TRDHGOKL ¸ h¹ [ lRDHGOKL ! JDO a¸G¹e NGEF EF]H t ¹ h¹[ \K FJUK ER

ILRUK EFJE a¸G¹ NGEF t ¸¹ t ¸ ¹M NK FJUK ! [ IKE j NGEF rbb Mj b M JDO rb [ IKE [ «H ; GHHEJYXK JDO aZ ¸·" ¹M aZ ¸·" ¹ m b[ \KFJUK j a¸G¹e mb NGEF t j¹ [ UV ·" ¸¹QOGJPDRHJYGXGEVM SRL JXX a¸G¹ NGEF t ¸¹ t ¸ ¹M NKFJUK ! aZ ¸·" ¹[ 8 /+(02 A ( "22 'B+0\K DRN ILRIRHK JXPRLGEFWH SRL EFK dGJPDRHGH ®LRYXKWYJHKO RD HEJDOJLO RIKLJEGRDH RD InH[ CD J ZLHE HEJPKNK YJHK EFK TRDHEL]TEGRD RS EFK IJKL S]DTEGRD RD EFKHVDTFLRDR]H ILRO]TE RS G JDO · JDO GEH OKEKLWGDGHJEGRDMJDO ILRUK EFJE EFK S]DTEGRD IJKL TRWI]EKH J lRLLKTE


dGJPDRHGH[ °KaEM NK ILRIRHK JD JXPRLGEFW JXXRNGDP ERTFKTf SRL EFK ·QOGJPDRHJYGXGEV RS JD InM EF]H KDH]LGDPEFK UR]DOKO dGJPDRHGH ®LRIKLEV RS EFK S]DTEGRD IJKL [§KDTK JTFGKUGDP EFK OKTGHGRD RS EFK dGJPDRHGH ®LRYXKW[ E C

\K ILRIRHK J TRWI]EJEGRD RS EFK S]DTEGRD IJKL PGUKDG JD In JDO J H]IKLUGHGRD IJEEKLD ·M NK ZLHE TRDHGOKLEFK HVDTFLRDR]H ILRO]TE G RS G JDO · HKK dKZDGQEGRD o¹[ °KaE NK IKLSRLW RD G J HKTRDO RIKLJEGRD HKKdKZDGEGRD ¹ NFGTF JHHRTGJEKH ER G J OKEKLWGDGHEGT InNLGEEKD z¸G¹[ \K EFKD HFRN FRN z¸G¹ ILRUGOKHJ S]DTEGRD IJKL OKXGUKLGDP J lRLLKTE dGJPDRHGH[IKE ]H ZLHE GDELRO]TK J OKEKLWGDGHJEGRD S]DTEGRD[ &+6 : ¸;<=>¹ 1+ 3% &'( ?-6@ rb k bH '@+ OKEKLWGDGHJEGRD RS : -. 6@+ &'(z¸:¹ ¸ b< >¹ ?@+,+ oZ 6@+ .+6 )A.71.+6. )A ; D399+/ WJTLRQHEJEKHE > => 3%/ < ¸ ST ¸ rb¹¹ Q 3%/ bH°REGTK EFJE SRL EFGH OKZDGEGRD EFK EJLPKE WJTLRQHEJEK

RS J ELJDHGEGRD K< GH RDXV TRWIRHKO RS HEJEKH= RS : NFGTF JLK EJLPKEH RS HKb]KDTKH RS ELJDHGEGRDH= L6K< = KDOGDP NGEF JD RYHKLUJYXK KUKDE [ nFK LKJQHRD SRL EFGH OKZDGEGRD GH EFK TRFKLKDTV NGEF ggmm[ CD SJTQEM SLRW EFK OKZDGEGRD RS < GD z¸:¹M NK GDSKL EFJEST ¸J s¹ ST ¸=J ggsmm¹M NFGTF WKJDH EFJE EFKWJTLRQHEJEK LKJTFKO SLRW J YV s GD z¸:¹ GH TRWQIRHKO RS EFK HKE RS HEJEKH EFJE JLK LKJTFKO SLRW =J YVELJ^KTERLGKH RS ggsmm GD : [GDJXXVM OKEKLWGDGHJEGRD ILKHKLUKH ELJTKHM HR NK FJUKa¸z¸:¹¹ ysªxw¸z¸:¹¹ ysªxw¸:¹[

\K DRN KaIXJGD EFK TRDHEL]TEGRD RS EFK OGJPDRHKL SLRWG JDO ·[ IKE ]H ZLHE TRDHGOKL EFK HVDTFLRDR]H ILRO]TEG G G· HKK dKZDGEGRD o¹[ \K EFKD PKE a¸G¹ a¸G¹ma¸·¹ a¸G¹ JH · GH TRWIXKEK EF]H a¸·¹ ¹[\K JXHR PKE a¸G ; G; ¹ a¸G¹ m aZ ¸·¹ WKJDGDPEFJE EFK ELJ^KTERLGKH RS G JTTKIEKO YV · JLK KaJTEXV EFKJTTKIEKO ELJ^KTERLGKH RS G [ GDJXXV DREK EFJE ;G; GHHEJYXK GD G JH YREF ; JDO ; JLK HEJYXK YV JHH]WIEGRD[\K DRN JIIXV OKEKLWGDGHJEGRD ER G [ \K FJUKysªxw¸z¸G¹¹ ysªxw¸G¹ vwxyz ¸G¹ EF]H SRLJXX s vwxyz ¸G¹M S¸J s¹ S ¸=J ggsmm¹[\K DRN KHEJYXGHF EFK SRXXRNGDP S]DOJWKDEJX LKH]XEH RDEFK TRDHEL]TEGRD z¸G¹, ), 3%2 s vwxyz ¸G¹ vwxyz ¸G¹ES¸>s¹ F; G; ggsmm FaZ ¸·¹ ¸i¹S¸> s¹ m ; G; ^

ggsmm m aZ ¸·¹ ^ ¸l¹¸i¹ WKJDH EFJE JXX ELJ^KTERLGKH TRWIJEGYXK NGEF J ELJTK

s JLK JTTKIEKO YV · GS JDO RDXV s XKJOH ER J WJTLRQHEJEKRDXV TRWIRHKO RS WJLfKO HEJEKH GD G [¸l¹ WKJDH EFJE JXX ELJ^KTERLGKH TRWIJEGYXK NGEF s JLKDRE JTTKIEKO YV · GS JDO RDXV GS s XKJOH ER J WJTLRQHEJEKRDXV TRWIRHKO RS ]DWJLfKO HEJEKH GD G [

nFK ILRRS RS ¸i¹ GH KHEJYXGHFKO YV EFK SRXXRNGDPHKb]KDTK RS Kb]GUJXKDTKHS¸J s¹ F ; G; S ¸=J ggsmm F ; G; ggsmm FaZoZ ¸G¹ ggsmm Fa¸G¹ maZ ¸·¹

GWGXJLXVM SRL EFK ILRRS RS ¸l¹ NK FJUKS¸J s¹ m ; G; ^ S ¸=J ggsmm m ; G; ^ ggsmm m aZoZ ¸G¹ ^ ggsmm m a¸G¹ m aZ ¸·¹ ^ ggsmm m aZ ¸·¹ ^ JH ggsmm Fa¸G¹

\K FJUK DRN EFK WJEKLGJX ER OKZDK EFK S]DTEGRD IJKLJDO ER RYEJGD EFK lRLLKTEDKHH dGJPDRHGH ®LRIKLEVM SRXQXRNGDP OGLKTEXV SLRW ®LRIRHGEGRD i[!:+ &+6 z¸G¹ 1+ 6@+ &'( 17-96 3. 31)*+E3%/ 9+6 IJKL¸s¹ 1+ (! -A S¸R s¹ F ; G;"! -A S¸R s¹ m; G; ^#! )6@+,?-.+

¸¹

IJKL D)C876+. 3 ),,+D6 -3B%).-.H$+, % ),/+, 6) -997.6,36+ 6@+ /-3B%).+, D)%5.6,7D6-)%E D)%.-/+, 6@+ &'( G )A -B7,+ % 9+A65@3%/.-/+ H $..7C+ ?+ ?3%6 6) /-3B%).+ 6@+ )DD7,,+%D+ )A 6@+A3796 +*+%6 ! H + 6@7. 7.+ 6@+ .78+,*-.-)% 8366+,% ·" /+5.D,-1+/ -% -B7,+ & 3%/ 17-9/ 6@+ 8,)/7D6 G G G·" H% 6@-. D3.+ G -. -.)C),8@-D 6) G ?-6@ .+6 )A C3,+/.636+. oilH '@+ /-3B%).+, 3. ?+99 3. -6. 3%.?+,.)163-%+/ 12 /+6+,C-%-.36-)% )A G -. 39.) ,+8,+.+%6+/ -%-B7,+ % ,-B@65@3%/ .-/+ H

' ( ) *+(* ,,

, - -*+(+) )*,

, , -. /0121

345 3 60GP]LK i G JDO GEH JHHRTGJEKO OGJPDRHKL N[L[E[ ·"

78E C 9 DE 8 «H NK FJUK KHEJYXGHFKO EFK lRLLKTEDKHH RS IJKL M JTQTRLOGDP ER ®LRIRHGEGRD hM EFK UR]DOKO dGJPDRHJYGXGEV®LRIKLEV RS IJKL GH ILRUGOKO YV EFK ·QOGJPDRHJYGXGEVRS G[ \K DRN ILRIRHK JD JXPRLGEFW SRL OKTGOGDP ·QOGJPDRHJYGXGEV ¸dKZDGEGRD ¹[nFGH JXPRLGEFW GH JOJIEKO SLRW gj hlm[ nFK GOKJ GH

EFJE G GH DRE ·QOGJPDRHJYXK GS EFKLK KaGHEH JD JLYGELJLQGXV XRDP ELJTK sM H]TF EFJE ENR ELJ^KTERLGKH TRWIJEGYXKNGEF s OGHJPLKK RD aZ ¸·¹ WKWYKLHFGI HKK EFK JYRUKKaJWIXK¹[ \K ZLHE GDELRO]TK EFK dKXJVQ¨YHKLUJEGRDJXQlXRH]LK :;¸G¹ EFJE ILKHKLUKH EFK GDSRLWJEGRD JYR]EaZ ¸·¹ WKWYKLHFGI NFGXK JYHELJTEGDP JNJV ]DRYHKLUQJYXK KUKDEH[ °KaEM J HKXSQILRO]TE :;¸G¹ G:;¸G¹JXXRNH ER KaELJTE SLRW J ELJTK s IJGLH RS ELJ^KTERLGKH RSG JDO ER TFKTf EFKGL aZ ¸·¹ WKWYKLHFGI JPLKKWKDE[


), 3% &'( : ¸;<=>¹E 6@+ dKXJVQŸHKLUJEGRDJXQlXRH]LK RS : -. 6@+ &'( ¸:¹ ¸;b<b=>¹ ?@+,+ = K<b = ?@+%+*+, = LK< = -% :A), .)C+ rb 3%/ bHUV OKZDGEGRDM SRL JXX s vwxyz ¸:¹M =J <b = GD:;¸G¹ GS JDO RDXV M ggsmm H[E[ = L< = GD : [lRDHGOKL DRN :;¸G¹ ¸;b<b=J¹ JDO XKE

:;¸G¹ G:;¸G¹ YK EFK In ¸; G;b< ¸=J =J¹¹[ UV OKZDGEGRD RS :; JDO HVDTFLRDR]H ILROQ]TEM GS s vwxyz ¸G¹ JDO ¸=J =J¹ < ¸==¹ EFKLK KaGHEH ggsmm H[E[ =J L< = JDO =J L< = GD G [

dGUKD OKZDKO JH JYRUK[ \K HJV EFJE ¸==¹ ;G;GH ·³qxxy¯ruxqNFKDKUKL = ;G; = ;G; [ËFKLNGHKM EFKV JLK TJXXKO ²uqxxy¯ruxq[« IJEF GD GH TJXXKO JD ³²uqxxy¯ruxq ¦s© GS GETRDEJGDH h TRDHKT]EGUK ·Q]DOKEKLWGDKO HEJEKH EF]H KUKDEH YKENKKD EFKW¹[ « IJEF GD GH JD ²uqxxy¯ruxqª~ª|x GS GE GH J TVTXK NFGTF HEJEKH JLK JXX ]DOKEKLWGDKO[\K DRN HFRN EFK LKXJEGRD YKENKKD ·¸¹QOGJPDRHJYGXGEV JDO EFK KaGHEKDTK RS Q]DOKEKLWGDKOIJEFH[ ++ G -. ·¸¹5/-3B%).319+ -A 3%/ )%92 -A 6@+,+ -.%) ,+3D@319+ 57%/+6+,C-%+/ 836@ -% E ]IIRHK EFKLK GH DR LKJTFJYXK Q]DOKEKLWGDKOIJEF[ IKE aZ ¸·¹ m b JDO a¸G¹e m bNGEF QQt ¹QQ [ \K HFR]XO ILRUK EFJE SRL JXX ggt ¸ ¹mmM aZ ¸·¹[ IKE s t ¸ ¹[ «DV IJEF¸=J =J¹ < ¸==¹ GD TJD YK OKTRWIRHKO GDER J IJEF¸=J =J¹

n< w w¹ p< ¸==¹ NGEF s t ¸ ¹ JDO sj t ¹[ \K FJUK QQsj QQ QQt ¹QQ EF]H w w¹

p< ¸==¹GH J IJEF NGEF KUKDEH[ UV FVIREFKHGHM RDK RS EFKHK HQEJEKHM HJV ¸= = ¹ GH OKEKLWGDKOM JDO JH aZ ¸·¹M¸= = ¹ GH H]LKXV GD ¸; G; ¹j [ °RN JH ; G; GH HEJYXKM¸==¹ ¸; G; ¹j [ LRW EFGH GE GH TXKJL EFJE SRL JXX ggt ¸ ¹mmM aZ ¸·¹[lRDUKLHKXVM H]IIRHK DRN EFJE EFKLK GH JD Q]DOKEKLWGDKO IJEF ww¹

p< ¸==¹ GD NGEFQQsj QQ G[K[ JXX HEJEKH RD EFK IJEF JLK ]DOKEKLQWGDKO¹[ CS EFGH IJEF GH LKJTFJYXK EFKLK GH JXHR J IJEF¸=J =J¹

n< w w¹[ «H w w¹ GH ]DOKEKLWGDKOM EFKLKKaGHEH ggsmm NGEF aZ ¸·¹ m b M JDO e aZ ¸·¹[ nFKLK JXHR KaGHEH a¸G¹e m b NGEFt ¹ sj [ «H JXX HEJEKH GD IJEF JLK ]DOKEKLWGDKOMEFKLK KaGHEH a¸G¹e m b NGEF t ¹ sj M Y]E e aZ ¸·¹[ \K EF]H FJUK aZ ¸·¹ m b JDO a¸G¹e mb NGEF QQt ¹QQ M JDO ggt ¸ ¹mmNGEF e aZ ¸·¹[ nFGH ILRUKH EFJE G GH DRE ·¸¹QOGJPDRHJYXK[

!:+ g G -. ·5/-3B%).319+ -A 3%/ )%92 -A 6@+,+ +5-.6. .7D@ 6@36 D)%63-%. %) ,+3D@319+ 57%/+6+,C-%+/836@HUJHKO RD EFKRLKW o JDO RD EFK SJTE EFJE GH ZDGEK HEJEKMNK TRDTX]OK EFJE G -. %)6 ·5/-3B%).319+ -A 3%/ )%92 -A D)%63-%. 3 ,+3D@319+ 7%/+6+,C-%+/ D2D9+HHGDP ®LRIRHGEGRD h JDO EFK TRDHEL]TEGRD RS M UKLGSVGDPOGJPDRHJYGXGEV JWR]DEH ER TFKTf EFK KaGHEKDTK RS LKJTFQJYXK ]DOKEKLWGDKO TVTXKH GD M LKELGKUGDP EFK GOKJ RS EFKJXPRLGEFW RS ghlj m[ UV lRLRXXJLV h JDO IKWWJ hM g A G -. ·5/-3B%).319+ 3%/ -. 6@+ 9+%B6@ )A6@+ 9)%B+.6 7%/+6+,C-%+/ 836@ )A E 6@+% G -. ·¸ h¹5/-3B%).319+ 3%/ %)6 ·¸¹5/-3B%).319+\K DRN H]WWJLG`K EFK ILRTKO]LK ER OKEKLWGDK NFKEFKLG GH ·QOGJPDRHJYXK \K IKLSRLW J OKIEF ZLHE HKJLTF RD NFGTF KGEFKL KaFGYGEH ]DOKEKLWGDKO TVTXK RL KDOH YVFJUGDP TRWI]EKO EFK XKDPEF RS EFK XRDPKHE ]DOKEKLWGDKOHKb]KDTK[ ŸUGR]HXVM EFGH FJH XGDKJL TRHE GD EFK HG`K RS [$+, g % ),/+, 6) -997.6,36+ 6@+ D)%.6,7D6-)% )A E9+6 7. D)C+ 13D 6) 6@+ 3C89+ &H ¸G" ¹ -. B-*+%-% -B7,+ 4 9+A65@3%/ .-/+ H '@+ ,+D63%B9+. D),,+.8)%/6) 6@+ C3,+/ .636+.H ")? ¸G¹ G¸G¹ -.B-*+% -% -B7,+ 4 ,-B@65@3%/ .-/+ H

!"#

$

GP]LK l :;¸G¹ JDO SRL EFK In RS GP]LK i'@+ 6789+. ¸hi¹ ¸i h¹ ¸hl¹ ¸l h¹ -% 3,+ 7%/+56+,C-%+/H ")? -6 -. +3.2 6) .@)? 6@36 6@+,+ -. %) 7%/+56+,C-%+/ D2D9+E ?@-D@ 3DD),/-%B 6) ),,)93,2 & +%.7,+.

6@36 G -. ·" 5/-3B%).319+H %/++/E 3. .))% 3. ! -. 6,-B5B+,+/E % -. )1.+,*+/ 3A6+, 6@+ )DD7,,+%D+ )A 3 0%-6+ %7C51+, )A )1.+,*319+ +*+%6. 1)7%/+/ 12 % H '@7. )1.+,*-%B% .7,+92 -%/-D36+. 6@36 ! )DD7,+/ -% 6@+ 83.6H« TRDELJLGRE D)%.-/+, 6@+ &'( G -% -B7,+ H B-*+%-% -B7,+ & @3. 7%/+6+,C-%+/ D2D9+. -% ¸hi¹ 3%/ ¸i h¹[email protected] G -. %)6 ·" 5/-3B%).319+H % A3D6E A), 3%2 E 6@+6,3'+D6),-+. x 3%/ j ! x 3,+ 1)6@ D)C836-19+?-6@ s xE ?@-9+ e a ¸·" ¹E ?@+,+3. j a ¸·" ¹H

'*,

,.345 3 60* ,,

, - -(/01(2

, *+( *+(+)( -1()

,

GP]LK G JDO GEH JHHRTGJEKO OGJPDRHKL N[L[E[ ·"

%D-/+%63992E 6@-. +3C89+ 8,)*+. 6@36 ·5/-3B%).319-62D3%%)6 1+ D@+D+/ /-,+D692 )% 6@+ /-3B%).+,H % A3D6 6@+/-3B%).+,. A),G 3%/·" -B7,+ E ,-B@6 3%/ A),G A), ·-B7,+ %E ,-B@6 3,+ -.)C),8@-DE 3%/G -. ·" 5/-3B%).319+?@-9+ G -. %)6H


GP]LK SRL EFK In G RS zaJWIXK h

6 %1& 2 "0&+nFK KaJWIXK NK OGHT]HH FKLK JDO PGUKD GD GP]LK p GXX]HQELJEKH EFK JIILRJTF ILKHKDEKO JYRUK[ CD EFGH KaJWIXKMNK HGWIXV WROKX EFK WRUKWKDE RS J IKLHRD GD J Y]GXOGDPTRWIRHKO RS JD vªx C¹M J |rysy~ ¸U¹M J yxªx¦rvu ¸«¹JDO J ªvxx³w©v¦ ¸l¹[ nFK ORRLH SLRW RDK IJLE RS EFKY]GXOGDP ER JDREFKL TJD YK EJfKD GD RDXV RDK OGLKTEGRD[nLJDHGEGRDH h WROKX EFK TLRHHGDPH RS EFK ORRLH[ RWKORRLH JLK HKT]LKO YV JTTKHHQTJLOH ¸IRHHGYXV JXXRNGDP EFKRYHKLUJEGRD¹[ \K JHH]WK EFJE EFKLK KaGHE JTTKHHQTJLOH

A

C

B

I

!!" "

GP]LK p G JDO GEH H]IKLUGHGRD IJEEKLD ·SRL EFK ORRLH M j JDO # M WKJDGDP EFJE NFKD JTEGUJEKOMGE GH IRHHGYXK ER RYHKLUK EFK SJTE EFJE RDK IKLHRD TLRHHKHEFK ORRL[ \K TRDHGOKL EFK H]IKLUGHGRD IJEEKLD PGUKD GD· ¸GP]LK p¹ NFGTF KaILKHHKH EFK SJTE EFJE PRGDP ENGTK ERTReKKQHFRI NGEFR]E PRGDP ER EFK XGYLJLV GH J YKFJUGR]LEFJE FJH ER YK H]IKLUGHKO[RXXRNGDP EFK OGeKLKDE HEKIH OKHTLGYKO GD EFK ILKUGR]HHKTEGRDHM EFK ILRO]TE G G G· GH ]HKO ER XJYKX EFKHEJEKH RS G NGEF LKHIKTE ER EFK H]IKLUGHGRD IJEEKLD ·[nFK TRLLKHIRDOGDP In GH OKHTLGYKO GD GP]LK $[

I,N1 A,N1 C,F

B,F

I,F

A,F

I,N B,N

C,N1A,N

% %&%'

%%

%(%)

%& %%'

%)%(

%%%)

% %(

%

GP]LK $ GIKE ]H ZLHE JHH]WK EFJE RDXV EFK JTTKHHQTJLOHM TRLQLKHIRDOGDP ER EFK KUKDEH M j JLK JTEGUJEKO JDO EF]HRYHKLUJYXK G[K[ b j[ °REGTK EFJE EFK HVHEKWEFKD FJH GDEKLDJX KUKDEH XRRIH[ nFK RYHKLUJYXK HVHEKWG[K[ :;¸G¹¹ GH PGUKD GD GP]LK ~[

I,N

A,N

I,F

A,FI,N1 A,N1%(

%(

%(%(

%(% %

%(%( %(

%

GP]LK ~ :;¸G¹ SRL b j

CE GH KJHV ER TFKTf EFJE EFK In :;¸G¹ G:;¸G¹ ¸DRE LKILKHKDEKO FKLK¹ FJH JD ]DOKEKLWGDKOLKJTFJYXK TVTXK *+,+*+, p< *+,+*+, p<*+,+*+- .p< *+,+*+-M EF]H G GH DRE ·QOGJPDRHJYXKNFKD EFK HKE RS RYHKLUJYXK KUKDEH GH b j[§RNKUKLM GS EFK JTTKHHQTJLO # GH JTEGUJEKO G[K[ b j #¹M EFK RYHKLUJYXK HVHEKW :;¸G¹ GH PGUKD YVEFK In LKILKHKDEKO GD GP]LK hk[

I,N

A,N

I,F

A,FI,N1 A,N1%(

%(

%(%(

%

%%%

%(%0% %0%

%(%

GP]LK hk :;¸G¹ SRL b j #

¨DK TJD TFKTf EFJE :;¸G¹ GH OKEKLWGDGHEGT[ nF]H :;¸G¹ G:;¸G¹ GH GHRWRLIFGT ER :;¸G¹MEF]H FJH DR ]DOKEKLWGDKO TVTXK[ lRDHKb]KDEXVM G GH·QOGJPDRHJYXK SRL b j #[ °REK EFJE NK JXHRFJUK EFJE z¸G¹ :;¸G¹[ nF]H :;¸G¹ JTE]JXQXV TRLLKHIRDOH ER EFK OGJPDRHKL[/ ,#+12nFK ILKHKDE IJIKL JOURTJEKH EFK ]HK RS H]IKLUGHGRD IJEQEKLDH SRL EFK OKHTLGIEGRD RS OGJPDRHGH RY^KTEGUKH[ « H]QIKLUGHGRD IJEEKLD GH JD J]ERWJERDM XGfK EFK RDKH ]HKO GDWJDV OGeKLKDE ORWJGDH ¸UKLGZTJEGRDM WROKXQYJHKO EKHEQGDPM IJEEKLD WJETFGDPM KET¹M GD RLOKL ER ]DJWYGP]R]HXVOKDREK J SRLWJX XJDP]JPK[«H GXX]HELJEKO GD EFK IJIKLM EFK SJ]XEQRTT]LLKDTK OGJPQDRHGH GH J IJLEGT]XJL TJHK RS IJEEKLD OGJPDRHGHM Y]E IJEQEKLDH JLK JXHR ]HKS]X ER OKHTLGYK WRLK PKDKLJX RY^KTEGUKHMJH HFRND GD H]YHKTEGRDH i[h JDO HKTEGRD [ nFK TRDTKIE RSH]IKLUGHGRD IJEEKLDH GH KUKD WRLK JEELJTEGUK GD EFK HKDHKEFJE IJEEKLDH TJD YK TRWIRHKO ]HGDP ]H]JX TRWYGDJERLHGDFKLGEKO SLRW XJDP]JPK EFKRLV ¸]DGRDM GDEKLHKTEGRDM TRDQTJEKDJEGRDM KET[¹[\K JLK GDEKLKHEKO GD OGJPDRHGDP EFK RTT]LLKDTK RS ELJQ^KTERLGKH UGRXJEGDP J HJSKEV ILRIKLEVM NFGTF YV OKZDGEGRDTJD YK UGRXJEKO RD J ZDGEK ILKZa[ CE GH EFKD DJE]LJX ERJHH]WK EFJE IJEEKLDH LKTRPDG`K KaEKDHGRDQTXRHKO XJDQP]JPKHM GD EFK HKDHK EFJE GS J ELJ^KTERLV RS EFK HVHEKWYKXRDPH ER EFK XJDP]JPKM HR ORKH JDV KaEKDHGRD RS EFGHELJ^KTERLV[ nFGH GH EKTFDGTJXXV JTFGKUKO YV EFK wsr|r~JHH]WIEGRD RD EFK J]ERWJERD ·[ «ORIEGDP EFK YKFJUQGRLJX ILRIKLEGKH IRGDE RS UGKN RD EFK IJEEKLDH XKJOH ER EFK


JEEKWIE ER OGJPDRHK JDV XGDKJL EGWK I]LK IJHE EKWIRLJXSRLW]XJH g$m[ CE GH TXKJL EFJE EFK ILRIKLEGKH NK TRDHGOKLOR DRE WKKE EFK InI OKZDJYXK ILRIKLEGKH FJDOXKO YV gm[CD EFK NRLLV RS KaIRHGDP J SJGLXV PKDKLJX SLJWKNRLf SRLOGJPDRHGH GHH]KHM EFK dGJPDRHGH ®LRYXKW GH ILKHKDEKO GD J

LJEFKL OKDREJEGRDJX HIGLGEM JH RIIRHKO ER EFK RIKLJEGRDJXHIGLGE NK ZDO GD EFK XGEKLJE]LK NK I]E EFK KWIFJHGH RDEFK OGJPDRHGH S]DTEGRD IJKL NGEF GEH TRLLKTEDKHH JDOYR]DOKODKHH OGJPDRHJYGXGEV ILRIKLEV[ lRLLKTEDKHH GH JDKHHKDEGJX ILRIKLEV EFJE KDH]LKH EFK JTT]LJTV RS EFK OGJPQDRHGH[ _RLKRUKLM UKLGSVGDP EFK OGJPDRHJYGXGEV ILRIKLEV RSEFK HVHEKW NGEF LKHIKTE ER EFK H]IKLUGHGRD IJEEKLD P]JLQJDEKKH EFJE NFKD ]HGDP IJKL RDXGDKM JD RTT]LLKDTK RSEFK IJEEKLD NGXX KUKDE]JXXV YK OGJPDRHKOM JDO EFJE EFGHKUKDE]JXGEV TJD YK b]JDEGZKO[ CE GH EFK HEJDOJLO DREGRDRS dGJPDRHJYGXGEVM Y]E HKKD FKLK JH J WKLK WKJD ERJTFGKUK J HJEGHSJTERLV OGJPDRHGH S]DTEGRDj NK JLK JNJLKEFJE EFGH IRGDE RS UGKN OGeKLH SLRW REFKL TXJHHGTJX JIQILRJTFKH[ nFK OKZDGEGRD RS OGJPDRHJYGXGEV JH ILRIRHKOFKLK GH J]ERWJEJQYJHKOM NGEF G JDO ·M Y]E TR]XO JH NKXXYK KaILKHHKO GD J XJDP]JPKQYJHKO SLJWKNRLf[\K DRN E]LD ER EKTFDGTJX JHIKTEH RS EFK JIILRJTF[\K FJUK GDHGHEKO RD NFJE EFK HKWJDEGTH RS J ELJTK GH J

ELJTK OKDREKH EFK HKE RS ELJ^KTERLGKH NFGTF ILR^KTE RDEREFGH ELJTK JDO EFJE DKTKHHJLGXV KDO ]I NGEF JD RYHKLUJYXKKUKDE[ lRDHKb]KDTKH RS EFGH TFRGTK JLK WJDGSRXO GD EFKOKZDGEGRDH RS ·QOGJPDRHJYGXGEVM z¸G¹ JDO :;¸G¹[\K TR]XO FJUK TFRHKD JDREFKL HKWJDEGTHM GWIJTEGDP RDEFK LKXJEKO OKZDGEGRDH JTTRLOGDPXV SRL KaJWIXK NK TR]XOFJUK TRDHGOKLKO EFK HKE RS ELJ^KTERLGKH NFGTF ILR^KTE RDQER EFGH ELJTK[ \FJE GH WRHEXV GWIRLEJDE GH EFK JTT]LJEKWJETF YKENKKD EFK HKWJDEGTH SRL ELJTKH JDO EFK REFKLOKZDGEGRDH FKDTK NK JURGO OGHIXKJHGDP OGHTLKIJDTGKH EROKEKLWGDK ILKTGHKXV EFK dGJPDRHJYGXGEV UR]DOM JDO KUKDYKEEKLM NK FJUK J TXKJL ILRRS SRL EFK TRLLKTEDKHH RS EFKHVDEFKHGH JXPRLGEFW[ §RNKUKLM NK YKXGKUK R]L TFRGTK GHEFK WRHE DJE]LJX NFKD JOWGEEGDP EFJE EFK OGJPDRHGH S]DTQEGRD GWIXKWKDEKO RDXGDK JH JD R]EI]E UKLOGTE GH LKJTEGUKER JD RYHKLUJYXK WRUK RS EFK HVHEKW[« WRLK HRIFGHEGTJEKO OGJPDRHGH EFJD EFK RDK KaIXJGDKOFKLK TJD YK OKLGUKO SLRW R]L TRDHEL]TEGRD Q EFGH GH SJGLXVHEJDOJLO SRL KaJWIXKM NK TJD EJfK JOUJDEJPK RS fDRNQGDP EFJE EFK dGJPDRHJYGXGEV YR]DO GH KaJTEXV [ «HH]WKEFJEM JSEKL J ELJTK sM EFK S]DTEGRD IJKL ILRO]TKH RDo TRDHKT]EGUK KUKDEHM EFKD DKTKHHJLGXV EFK ELJ^KTERLGKHTRWIJEGYXK NGEF s TJDDRE FJUK WKE EFK IJEEKLD[\K EKLWGDJEK EFK OGHT]HHGRD NGEF S]E]LK NRLf IKLQHIKTEGUKH JGWGDP ENR GDOKIKDOKDE RY^KTEGUKH[ nFK ZLHERY^KTEGUK GH ER KaEKDO EFK JXPRLGEFWH ER WRLK KaILKHHGUKTXJHHKH RS HVHEKWHM H]TF JH GDZDGEK HVHEKWH NFKLK OJEJGDSRLWJEGRDH GH KaIXRGEKOj EFGH NR]XO KDXJLPK HGPDGZTJDEXVEFK JIIXGTJYGXGEV RS EFK WKEFROH[ nFK HKTRDO RY^KTEGUK GHER LKXJa EFK HEJYGXGEV JHH]WIEGRDM RL Kb]GUJXKDEXV ER E]LDER XJDP]JPKH NFGTF JLK DRE KaEKDHGRDQTXRHKOM GDEKDOGDPER KDTRWIJHH SLJWKNRLfH XGfK gom SRL GDEKLWGEEKDE SJ]XEHMSRL NFGTF J IRHHGYXK H]IKLUGHGRD IJEEKLD NR]XO YK PGUKDYV EFK SRXXRNGDP In[

# % & %

$'(A* $'(A*

$'()*

$'()*

& &&

.A#2 ; ! "!! #"$ %&% ''% : ( ) * + *,- .,-/0 , ! ¡%¥1&%£% %££2 3 ) ( ) * + *,- .,-/ 0 , ! £¡4%¥122&¤ %£££ 5 ; ( 6 3 " ! * #*$ %£££7 6 8 9 ( 3 : " 0 - ! ¤¡¥12&2% %££¤ 6 3 : 5 ; < " 0 - !'¡¤¥1'2&'7 %££ 6 3 : 8 4 " 0 => -'¡%¥12£&2%2 %££2 5 ; 0 ! - .¡%¥12£2&2% ''7' 3 ? !!@- 0 !> -- #$ % AB. %7&2 9 '£ C D : ( ; ; ! "!! E ! ¤¡%¥ %££7 3) D : ( 1 F ; G " ! H I * + .,- 2£&2 ''% 3 : ) " 0 - ! £¡'¥1777&77 ''72 3 : ) 5 " 0 ! .,- 0!, ¡%¥1£7&% ''¤ C < " 0 J - ! ¡'¥ %££%


On-line diagnosis for Time Petri Nets

G. Jiroveanuy; B. De Schutterz R.K. Boelyy EESA - SYSTeMS, University of Ghent, Belgiumfgeorge.jiroveanu,[email protected] DCSC, Delft University of Technology, The [email protected]

Abstract

We derive in this paper on-line algorithms for faultdiagnosis of Time Petri Net (TPN) models. Theplant observation is given by a subset of transi-tions while the faults are represented by unobserv-able transitions. The model-based diagnosis usesthe TPN model to derive the legal traces that obeythe received observation and then checks whetheror not fault events occurred. To avoid the consider-ation of all the interleavings of the concurrent tran-sitions, the plant analysis is based on partial orders(unfoldings). The legal plant behavior is obtainedas a set of configurations. The set of legal traces inthe TPN is obtained solving a system of(max;+)-linear inequalities called the characteristic systemof a configuration. We present two methods to de-rive the entire set of solutions of a characteristicsystem, one based on Extended Linear Comple-mentarity Problem and the second one based onconstraint propagation that exploits the partial or-der relation between the events in the configuration.

1 IntroductionThis paper deals with the diagnosis of TPNs. TPNs are exten-sions of untimed Petri Nets (PNs) where information aboutthe execution delay of some operations is available in themodel. In a TPN a transition can be fired within a given timeinterval after its enabling and its execution takes no time tocomplete. A trace in the plant comprises the transitions thatare executed in the TPN model (the untimed support) as wellas the time of their occurrence.

Since a transition can be executed at any time within aninterval after it has become enabled, the state space of TPNsis in general infinite. Methods based on grouping states undera certain equivalence relation onto so called state classeswereproposed in[2]. The state class graph was proved to be finiteiff the net is bounded, thus infinite state spaces can be finitelyrepresented and the analysis of TPN models is computable.Supported by a European Union Marie Curie Fellowship duringhis stay at Delft University of Technology (HPMT-CT-2001-0028).Currently with TRANSELECTRICA SA, Craiova, Romania.e-mail: [email protected]

We consider the plant observation given by a subset oftransitions whose occurrence is always reported. Moreoverthe time when an observed transition is executed is measuredand reported according to a global clock. The unobservableevents are silent, i.e. the execution of an unobservable transi-tion is not acknowledged to the monitoring system. The faultsare modeled by a subset of unobservable transitions.

The model-based diagnosis for TPNs comprises twostages. First the set of traces that are legal and that obey thereceived observation is derived and then the diagnosis resultof the plant is obtained checking whether some or all of thelegal traces include fault transitions.

The diagnosis of a TPN can be derived based on the com-putation of the state class graph as proposed in[5]. How-ever the analysis of TPNs is not tractable even for models ofreasonable size because of the interleaving of (unobservable)concurrent transitions.

Partial orders were shown to be an efficient method to copewith the state space explosion of untimed PNs because theinterleaving of concurrent transitions is filtered out[4],[8].They were also found applicable for the analysis of PN mod-els where the time is considered as quantifiable and continu-ous parameter[1],[3].

In this paper we extend the results presented in[6],[7] pre-senting on-line algorithms for the diagnosis of TPNs basedon partial orders. The plant analysis is based on time con-figurations (time-processes in[1]). A time configuration is anuntimed configuration (a configuration in the net-unfoldingofthe untimed PN support of the TPN model) with a valuationof the execution times for its events. A time configuration islegal if there is a time trace in the original TPN that can be ob-tained from a linearization of the events of the configurationwhere the occurrence times of the transitions in the trace areidentical with the valuation of their images in the time con-figuration. A linearization of the events in a configuration isa trace that comprises all the events of the configuration exe-cuted only once such that the partial order between the eventsin the configuration is preserved in the order in which theyappear in the trace.

The on-line diagnosis algorithm that we propose works asfollows. When the process starts we derive time traces inthe TPN model up to the first discarding time. A discard-ing time is the time when in absence of any observation onecan discard untimed support traces and it corresponds with


the smallest value of the latest time when an observable eventcould be forced to happen. The occurrence of an observabletransition before the first discarding time is taken in to ac-count eliminating traces that are not consistent with the re-ceived observation. Then the plant behaviour is derived up toa next discarding time.

The set of all legal time traces in the original TPN canbe obtained computing for each configuration the entire so-lution set of a system of(max;+)-linear inequalities calledthe characteristic system of the configuration.

The calculations involve time interval configurations. Atime interval configuration is an untimed configuration en-dowed with time intervals for the execution of the eventswithin the configuration. A time interval configuration is le-gal if for every event and for every execution time of the eventwithin its execution time interval there exists a legal timecon-figuration that considers the event executed at that time.

Thus, we need to derive for each configuration the entiresolution set of its characteristic system. The naive approachto enumerate all the possiblemax-elements would imply tointerleave concurrent events which is exactly what we wantedto avoid by using partial orders to represent the plant be-haviour. To cope with this difficulty we present two methodsthat avoid the explicit consideration of all the cases for eachmax-term in the characteristic system.

The first method uses the Extended Linear Complementar-ity Problem (ELCP)[10] for deriving the set of all solutionsof the characteristic system of the configuration. The solutionset can be represented as a union of faces of a polyhedron thatsatisfy a cross-complementarity condition.

The second method is based on constraint propagation andexploits the partial order relation between the events withinthe configuration. We derive for each untimed configurationa set of hyperboxes of dimension equal with the number ofevents within the configuration such that the union of all thesubsets of solutions that are circumscribed by the hyperboxesis a cover of the solution set.

The paper is organized as follows. In Section 2 we providedefinitions and the notation used in the paper. In Section 3 weformalize the diagnosis problem for TPNs models. The anal-ysis of TPNs based on partial orders is described in Section 4.Section 5 and Section 6 present the two methods to derive thesolution set of a characteristic system of a configuration andthen in Section 7 we present the on-line diagnosis algorithmthat we propose. The paper is concluded in Section 8 withfinal remarks and future work.

2 Notation and definitions2.1 Petri netsA Petri Net is a structureN = (P ; T ; F )whereP denotes theset ofj P j places,T denotes the set ofj T j transitions, andF = Pre [ Post is the incidence function wherePre(p; t) :P T ! f0; 1g andPost(t; p) : T P ! f0; 1g are thepre-andpost-incidence functionthat specify the arcs.

We use the standard notations:p, p for the set of input,respectively output transitions of a place; similarlyt andtdenote the set of input places tot, and the set of output placesof t respectively. AmarkingM of a PN is represented by a

j P j-vector,M : P ! IN, that assigns to each place ofN anon-negative number of tokens.

The setLN (M0) of all legal traces of a PN,hN ;M0i,with initial marking M0 is defined as follows. A transi-tion t is enabled at the markingM if M Pre(; t).Firing, an enabled transitiont consumesPre(p; t) tokensin the input placesp 2 t and producesPost(t; p) to-kens in the output placesp 2 t. The next marking isM 0 = M + Post(t; ) Pre(; t). A trace is defined as = M0 t1! M1 t2! : : : tk! Mk, where fori = 1 : : : k,Mi1 Pre(ti). M0 ! Mk denotes that the sequencemay fire atM0 yieldingMk.

A PN hN ;M0i is 1-safeif for every placep 2 P we havethatM(p) 1 for any markingM that is reachable fromM0.2.2 Occurrence netsDefinition 1 Given a PNN = (P ; T ; F ) the immediate de-pendence relation1 (P T ) [ (T P) is defined as:8(a; b) 2 (P T ) [ (T P) : a 1 b if F (a; b) 6= 0Define as the transitive closure of1 (=1).

The immediate conflict relation1 T T is defined as:8(t1; t2) 2 T T : t11t2 if t1 \ t2 6= ;Define (P [ T ) (P [ T ) as8(a; b) 2 (P [ T ) (P [ T ): ab if 9t1; t2 s.t. t11t2 andt1 a andt2 b.The independence relationk (P [ T ) (P [ T ) is

defined as8(a; b) 2 (P [ T ) (P [ T ):akb) :(ab) ^ (a 6 b) ^ (b 6 a)Definition 2 Given two PNsN = (P ; T ; F ) and N 0 =(P 0; T 0; F 0), is a homomorphism fromN to N 0, denoted : N ! N 0 wherei) (P) P 0 and(T ) T 0 and ii)8t 2 T , the restriction of to t respectivelyt is a bijectionbetweent and (t) respectively betweent and(t).

Definition 3 An occurrence net is a netO = (B;E;1)such that:i) 8a 2 B[E : :(a a) (acyclic);ii) 8a 2 B[E: j fb : a bg j< 1 (well-formed);iii) 8b 2 B : j b j 1(no backward conflict).

In the followingB is referred as the set of conditions whileE is the set of events.

Definition 4 A configurationC = (BC ; EC ;) in the oc-currence netO is defined as follows:

i) C is a proper sub-net ofO (C O)

ii) C is conflict free, i.e.8a; b 2 (BC [EC) (BC [ EC)) :(ab)iii) C is causally upward-closed, i.e.8b 2 BC[EC : a 2 B[E anda 1 b) a 2 BC[ECiv) min(C) = min(O)

Definition 5 Consider a PNhN ;M0i s.t.8p 2 P :M0(p) 2f0; 1g. A branching processB of a PN hN ;M0i is a pairB = (O ; ) whereO is an occurrence net and is a homo-morphism : O ! N s.t.:

1. the restriction of to min(O) is a bijection betweenmin(O) andM0 (the set of initially marked places)


2. (B) P and(E) T3. 8a; b 2 E : ( a = b) ^ ((a) = (b)) ) a = bFor a configurationC in O denote byCUT (C) the maxi-

mal (w.r.t. set inclusion) set of conditions inC that have nosuccessors inC:CUT (C) = (( [e2EC e) [ (min(O)) n ( [e2EC e)Definition 6 Given a PNhN ;M0i and two branching pro-cessesB;B0 of PN hN ;M0i thenB0 B if there exists aninjective homomorphism' : B0 ! B s.t. '(min(B0)) =min(B) and Æ ' = 0.

There exists a unique (up to an isomorphism) maximumbranching process (w.r.t.) that is the unfolding ofhN ;M0iand is denotedUN (M0) [8].

Denote byC the set of all the configurationsC of the oc-currence netUN (M0). For a configurationC 2 C denote byhECi the set of strings that are linearizations of(EC ;)where a string = e1e2 : : : e is a linearization of(EC ;)if =j EC j and8e; e 2 EC we have that:i) e = e ) = andii) for 6= , if e e then < .

2.3 Time Petri netsA Time Petri Net (TPN)N = (P ; T ; F; Is), consists ofan (untimed) Petri NetN = (P ; T ; F ) (called the untimedsupport ofN ) and the static time interval functionIs : T !I(Q+), Ist = [Lst ; Ust , Lst ; Ust 2 Q+, representing the set ofall possible time delays associated to transitiont 2 T .

In a TPN hN ;M0 i we say that a transitiont becomesenabled at the timeent then the clock attached tot is startedand the transition t can and must fire at some timet 2 [ent +Lst ; ent +Ust , providedt did not become disabled because ofthe firing of another transition. Notice thatt is forced to fireif it is still enabled at the timeent + Ust .

Definition 7 A state at the time (according to a globalclock) of a TPNhN ;M0 i is a pair S = (M;FI) whereM is a marking andFI is a firing interval function associ-ated with each enabled transition inM (FI : T ! I(Q+)).

If t is executed at the timet 2 Q+ we write(M;FI) ht;ti! (M 0; F I 0) or simplyS ht;ti! S0 where:

1. (M Pre(; t) ^ t ent + Lst ) ^ (8t0 2 T s.t.M Pre(; t0)) t ent0 + Ust0)2. M 0 =M Pre(; t) + Post(t; )3. 8t00 2 T s.t.M 0 Pre(; t00) we have:

(a) if t00 6= t andM Pre(; t00) thenFI(t00) = [max(ent00 + Lst00 ; t); ent00 + Ust00 (b) elseent00 = t andFI(t00) = [ent00 +Lst00 ; ent00 +Ust00

A legal time trace in a TPN N satisfies: =S0 ht1;t1 i! S1 ht2;t2i! : : : ht;t i! S whereS ht+1;t+1i! S+1 for = 0; : : : ; 1.In the following for a time trace we use the notation to

denote its untimed support. For the initial stateS0 we use alsothe notationM0 . DenoteLN (M0 ) the set of all legal time

traces that can be executed inhN ;M0 i. We callLN (M0 )the time language of the TPNhN ;M0 i.LN (M0 ) is the untimed support language of the time lan-guageLN (M0 ) i.e.LN (M0 ) = j 9 2 LN (M0 ).

3 Diagnosis of TPNsWe consider the following plant description:

1. the TPN modelhN ;M0 i is untimed 1-safe

2. T = To [ Tuo whereTo is the set of observable transi-tions andTuo is the set of unobservable transitions

3. lo is the observation labeling functionlo : T ! o [fg whereo is a set of labels and is the empty label.lo(t) = if t 2 Tuo andlo(t) 2 o if t 2 To4. when an observable transitionto 2 To is executed in the

plant the labello(to) is emitted together with the globaltimelo(to) when this execution ofto took place

5. the execution of an unobservable transition does not emitanything (is silent)

6. the faults are modeled by a subset of unobservableevents,Tf Tuo; lf : Tuo ! f [ fg is the fault la-beling function (f is a set of labels and is the emptylabel);lf (t) = if t 2 Tuo nTf andlf (t) 2 f if t 2 Tf

7. the faults are unpredictable, i.e.8t 2 Tf , 9t0 2 T n Tfs.t. i) t0 t andii) Lst0 Ust .

The plant observation at the time thenth observedevent is executed in the plant is denoted asOn =hobs1; obs1i; : : : ; hobsn; obsni, whereobs1; : : : ; obsn 2 oare the labels that are received andobs1 obs2 : : : obsnare the occurrence times of the corresponding events.

Denote byOn; the plant observation at the time > obsn ,i.e. On; includes also the information that no observableevent occurred in the interval[obsn ; .LN (On) is the set of all time traces that are feasible inhN ;M0 i up to the time of the last observationobsn andthat obey the received observationOn where 2 LN (On)if: i) 2 LN (M0 ; obsn) ( is legal); ii) lo() =obs1; : : : ; obsn ( obeys the”untimed” observation), andiii) for each observable transitiontok 2 To, k = 1; : : : ; n wehave thatlo(tok) = obsk ) tk = obsk ( obeys the execu-tion times of the observed transitions).

Similarly LN (On;) is the set of all time traces that arefeasible inhN ;M0 i up to the time and that obey the re-ceived observationOn;.

The plant diagnosisDN (On;) based on the receivedobservationOn; comprises the untimed strings obtainedby projecting the untimed support traces contained inLN (On;) onto the set of fault transitionsTf :DN (On;) = nf j 2 LN (On;) andf = lf ()o

The diagnosis resultDRN (On;) indicates that a fault forsure happened if all the traces contain fault events, i.e.DRN (On;) = fFg , =2 DN (On;)


If DN (On;) contains only the empty string then the diag-nosis result is normal, i.e.DRN (On;) = fNg , DN (On;) = fg.Otherwise the diagnosis result is uncertain, i.e. a fault couldhave happened but did not necessarily happen[9].

4 The analysis based on partial ordersThe partial order reduction techniques developed for untimedPN [8] are shown in[1],[3] to be applicable for TPN. Con-sider a configurationC in the unfoldingUN (M0) of the un-timed PN support of a TPN. Then consider a valuation ofthe execution times at which the eventse 2 EC in the con-figurationC are executed, that is for eache 2 EC consider atime valuee 2 TT (TT the time axis) at whiche occurs andis anj EC j-tuple representing the execution times of all thethe eventse 2 EC .

An untimed configurationC with a valuation 2 TTjEC jof the execution time for its events is called a time configura-tion of the TPN. A time configuration is legal if there is a legaltrace 2 LN (M0 ) in the TPNhN ;M0 i whose untimedsupport is a linearization of the partial order relation of theevents in the configuration (i.e. = () and 2 hECi)while the execution timet of every transitiont considered inthe trace is identical with the valuation of the evente forwhich t is its image via.

Consider an untimed configurationC 2 C. The TPNC =(BC ; EC ;; min(UN ); Is) is obtained by attaching to eachevente 2 EC the static intervalIst that corresponds in theoriginal TPN to transitiont s.t.(e) = t.

Denote byeKC the following system of inequalities:eKC maxe02e(e0) + Lse e maxe02e(e0) + Use 8e 2 EC(1)

where in (1)e = ; impliesmaxe02e(e0 ) = 0.

Proposition 1 8 2 LN (M0 ) we have that if = ()and 2 hECi, then is a solution of eKC , where =(t1 ; : : : ; tjEC j) = (e1 ; : : : ; ejEC j) with (ei) = ti, i =1; : : : ; j EC j.Proof: The proof is straightforward. 2

Denote bySol( eKC) the set of all solutions ofeKC . Thej EC j-hyperboxeI that circumscribesSol( eKC) is easilyobtained as:8e 2 EC , eI(e) = [eL(e); eU(e) with eL(e) =maxe02e(eL(e0))+Lse andeU(e) = maxe02e(eU(e0))+Usewhere8e 2 EC s.t. e = ;, eL(e) = Lse andU(e) = Use .

Example 1 Consider the TPN displayed in Fig. 1. Static in-tervals are attached to each transition. The observable tran-sitions aret4, t7 andt10 and they emit the same label.t3 andt9 are faulty transitions.

In Fig. 2 a part of the unfoldingUN (M0) is displayedwhere attached to each evente 2 E is the intervaleI(e).

We cannot claim yet that forC 2 C there exists at least alegal time configuration that corresponds withC because fora general TPN the enabling of a transition does not guaran-tee that it eventually fires because some conflicting transitionmay be forced to fire beforehand.

p1

t1p3

p5

p7

p2

p4

p6

p8

t2

t3

t5

t6

t8

t9

t11t12

[3,9]

[2,4]

[2,4]

[1,5]

[1,2]

[1,4]

[1,3]

[2,5]

[2,9]

t4[1,8]

t7[9,10]

t10[2,9]

p9

p11

p10

Figure 1:

b1

b5

b7b4

b8

e2

e3e6 e9

e11[2,4]

[4,9] [2,7] [3,8]

[2,5]b11

b10

b2

e5

[1,5]

e8[1,4]

e1[5,13]

b3 b6b9 b’10b’1

e’2[7,17]

b’2

e’11[6,19]

b’11

e12[4,14]

bb1 b’4

b’’4b7

b’7

b’10

e4[5,17]

e7[11,17]

e10[5,17]

b’3 b’’9

bb’1b’4

e’4[10,29]

e’3[9,21]

b’’’7b’’10

e’’10[9,31]

e’’9[7,22]

e’1[10,26]

b’’1e’’2

[12,30]

b’10e’11

[10,33]

e12[8,28]

b’’11b’’2

Figure 2:

Denote byEC the set of conflicting events of a configura-tion C 2 C where EC comprises the events that could havebeen executed but are not included inEC :EC = fe 2 E nEC j e BCg

The characteristic systemKC of configurationC 2 C isobtained by adding toeKC inequalities regarding the con-flicting events :KC = 8<: maxe02e(e0) + Lse e maxe02e(e0) + Use 8e 2 ECmine01e(e0) maxe002 e(e00) + Use 8e 2 ECProposition 2 Given an arbitrary time we have that 2LN (M0 ; ) iff: i) = (), 2 hECi andC 2 C; ii) is a solution ofKC , iii) 8e 2 EC ) e , and iv)8e 2 ENABLED(C), e .Proof: ) Since the PN is 1-safe we have that for any legaluntimed trace there exists a unique configurationC s.t. 2hECi. Condition1; 3 and4 are trivial and the proof that = (t1; : : : ; tn) is a solution ofKC is simply by induction.( The proof is trivial. 2

The problem the we should answer next is:”Up to whattime to make the calculations for the on-line monitoring ?”.

There are different solutions to answer this question, de-pending on the computational capability, the plant behaviour,and the requirements for the diagnosis result.

Solution 1: Calculations in advanceThe first solution is appropriate for a plant known to have acyclic operation, where each operation cycle is initiated bythe plant operator.

Having derived the plant behaviour up to the time thatcorresponds with the completion of an operation cycle, theplant is monitored on-line in the following way:


1. the received observation is taken in to account adding(in)equality constraints to the characteristic system of aconfiguration.

2. or configurations are discarded when the current timeexceeds the latest execution time of an observable eventin a configuration.

The main drawback of this method is that a large amountof calculations is performed in advance and then discardedbecause of the received observation.

Solution 2: Calculations after each observationThe second solution is to perform calculations each time anevent is observed in the plant. E.g. when the first observableevent is executed in the plant we derive the plant behaviourup to the timeobs1 in the following way.

Let the first observation beO1 = hobs1; obs1i. Considerthe set of configurationsC(O1) s.t.C 2 C(O1) if:

1. EC contains only one eventeo s.t. (eo) 2 To andlo((eo)) = obs1, andobs1 2 eI(eo)2. 8e 2 CUT (C) : eL(e) < obs13. 8e 2 ENABLED(C) : eU(e) > obs1

whereENABLED(C) denotes the set of events that corre-spond to transitions that are enabled from(CUT (C)).

The characteristic systemKC(O1) of configurationC 2C(O1) is obtained by adding toeKC inequalities regardingthe conflicting events and the received observation.

This method requires less computation but the price to bepaid is that a fault may be detected with a delay. This is be-cause no calculations are performed until a new observationis received, thus the fact that the current time of the plant ex-ceeds the latest execution time of an observable event is nottaken in to account.

However this method is practically useful when the fre-quency of observations is high, i.e. the time interval in be-tween two observations is short and control actions are in-evitably taken with some latency. Moreover this method isalso suitable when the plant observation is known to be un-certain, i.e. the observation of an event can be lost becauseofa sensor failure. This is because in between two observationsthe diagnosis result w.r.t. the detection of the faults thatforsure happened does not change if the observation is uncertain.

Solution 3: Calculations up to a discarding timeA discarding time is the earliest time when in absence of anyobservation one can discard untimed support traces becauseit can be proved that they are not valid. E.g. the first discard-ing time is the smallest latest execution time of an observabletransition in the plant.

Definition 8 A configurationC 2 C is derived up tothe time if: i) maxe2 CUT (C)(eL(e)) and ii)mine2ENABLED(C)(eU(e)) > . Given a configurationC 2 C that is derived up to a time0, denote byC() the setof extensions ofC up to the time > 0 whereC` 2 C()if: i) C C` (C` is a continuation ofC ) andii) C` isderived up to the time.

The first discarding time is calculated iteratively as fol-lows. is initiated with a big value (say+1 for simplic-ity) and then starting from the initial configurationC? =(B?; E?;1) we construct an initial part of the net unfold-ing by appending events as in the untimed case, the onlydifference being that among all the enabled events denotedby ENABLED(C) only the events with the smallest upperlimit eU(e) are appended, until the first observable event sayeo is encountered.

The discarding time is set equal toeU(eo) and then theconfigurations that containeo are extended up to the timeeU(eo). Denote this set byCnewobs . Then for each configurationC 2 Cnewobs we calculateSol(KC ) and for those configura-tions that have a non-empty solution set we calculateU(e0o),i.e. the smallest latest time when an observable evente0o canbe executed. ObviouslyU(e0o) eU(eo).

The discarding time is set as the smallest latest time whenan observable event can be forced to execute considering allC 2 Cnewobs . Notice that a configurationC may contain someother observable events and after calculatingSol(KC ) someother observable event may have the smallest latest time forits execution. Then recursively all the configurations thatcon-tain only unobservable events are extended up to the new dis-carding time by appending event(s) selected among all theenabled with the smallest upper limiteU(e). Continue this op-eration until either a new observable event is encountered orno more events can be appended.

Notice that because is calculated recursively some con-figurations (that contain at least one observable events) arederived up to times bigger than. However this does not af-fect the diagnosis result since the events that can be executedafter the time are seen as a prognosis.

The on-line diagnosis algorithm works as follows. Whenthe process starts we derive the set of configurations up to thefirst discarding time and then we have two cases:

Case 1If no observation is received before the time then:

1. the configurations that contain observable events havingthe upper limit equal to are discarded

2. for all the other configurations that contain observableevents inequalities of the form:Kobs = neo > j eo 2 EC and(eo) 2 Tooare added to the characteristic systemsKC and we de-rive the entire solution set

3. for all the configurationsC 2 Cuno that contain onlyunobservable events we check only ifSol(KC) has annon-empty set of solutions.

4. denote byE(O0;) the set of traces that are obtained as

linearizations of the set of events of the configurationsthat are not discraded.

5. the diagnosisDpoN (O0;) is obtained projectingE(O0;)onto the set of fault transitionsTf

Case 2If the first observationhobs1; obs1i is received be-fore the time of the process becomes then:


1. the set of configurationsCunu that contain only unob-servable events is discarded

2. for each configurationC 2 Cobs that contains observ-able events an equality relation:K0obs1 = feo = obs1 j lo(eo) = obs1 ^ eo 2 Cgand for observable events other thaneo inequalities ofform:K00obs1 = ne0o > j e0o 2 EC and(e0o) 2 Tooare added to the characteristic systemKC and then wederive the entire solution set

3. denote byE(O1) the set of traces that are obtained aslinearizations of the set of events of the configurationsthat are not discarded.

4. DpoN (O1) is obtained projectingE(O1) ontoTfNotice that the plant diagnosis is derived either at the time

of the first observed eventDpoN (O1) or in absence of any

observation at the first discarding time,DpoN (O0;).Theorem 1 Given a TPN modelhN ;M0 i we have that:

1. when the first observable event is executed:DRN (O1) = fFg , DRpoN (O1) = fFg2. if no observation is received until the first discarding:DRN (O0;) = fFg , DRpoN (O0;) = fFg3. and for any time , in absence of any observation,

the diagnosis result is different fromF :DRN (O0;) 6= fFgProof: (1) and (2) have a similar proof. Based on Propo-sition 2 we calculate the set of legal traces up to given time. However some configurations include events that are ex-ecuted after the timeobs1 or . Since the faults are unpre-dictable the consideration of some events that can be executedafter the timeobs1 or does not change the diagnosis resultw.r.t. the detection of faults that for sure happened.(3) isproved straightforwardly by the assumption that the faultsareunpredictable. 2Remark 1 Obviously by imposing the inequality that all theevents in a configuration have execution times smaller thanobs1 or allows one to derive exactly the diagnosis resultby removing the events that can be executed after the timeobs1 respectively. However this is not efficient for practicalcalculations especially when the frequency of observations ishigh. Notice also that calculations in advance are not fullydeveloped, thus it may be that an event that is considered ex-ecuted afterobs1 might not be executed since an event that issuccessor of the observed event can pre-empt its execution.

In what follows we present two methods to derive the so-lution set of the characteristic system of a configuration. Thefirst method is based on the ELCP and derives the entire so-lution set as a union of faces of a polyhedron that satisfy thecross-complementarity condition[10].

The second method is based on constraint propagation andderives for a configurationC a set ofj EC j-hyperboxes s.t.the union of the subsets of solutions that are circumscribedbythej EC j-hyperboxes is a cover of the entire solution set.

5 The method based on ELCPThe ELCP is defined as follows (see[10]). Given A 2IRwz, G 2 IRqz , 2 IRw, d 2 IRq , andm index sets 1; : : : ; m f1; : : : ; wg, findx 2 IRz such thatAx ; Gx = d (2)mXj=1 Yi2 j(Ax )i = 0 : (3)

Condition (3) can be interpreted as follows. SinceAx ,all the terms in (3) are nonnegative. Hence, (3) is equivalenttoQi2 j (Ax )i = 0 for j = 1; : : : ;m. So we could

say that each set j corresponds to a group of inequalities inAx , and that in each group at least one inequality shouldhold with equality. In[10] we have developed an algorithmto find all solutions of an ELCP. This algorithm yields a de-scription of the complete solution set of an ELCP by finitepoints, generators for extreme rays, and a basis for the linearsubspace associated with the maximal affine subspace of thesolution set of the ELCP.

Let us now explain how(max;+) equations of the formmaxi2J (i) + L maxi2J (i) + U (4)

can be recast as an ELCP. First we introduce a dummy vari-able = maxi2J i. Then (4) reduces to + L + U ; (5)

which already fits the ELCP format. Let us now look at theequation = maxi2J i. This can be recast as i for all i 2 J ; (6)

where for at least one indexi 2 J equality should hold, i.e.,Yi2J ( i) = 0 : (7)

Clearly, equations (5)–(7) constitute an ELCP.ThusKC can be treated as an ELCP. First we derive the

polyhedron that provides the set of solution for the system oflinear (in)eaqualities given by 2. The solution set of the ELCPis obtained as a union of faces of a polyhedron that satisfy thecross-complementarity condition[10].

6 The method based on constraintpropagation

Before formally presenting the second algorithm we intro-duce first the definition of a time interval configuration.

A time interval configurationC(I) is an untimed configu-rationC 2 C endowed with time intervals for the executionof the events within the configuration.I is a vector of dimen-sion j EC j that comprises for each evente 2 EC the timeintervalI(e) in which the evente is assumed executed.

Definition 9 Given the observationO1 and a configurationC 2 C(O1) we have that the time interval configurationC(I) is legal if for any eventei (8ei 2 EC) and for anyexecution timeei of the eventei (8ei 2 I(ei)) there ex-ist execution times for all the other events within the config-uration (9ej 2 I(ej) for all ej 2 EC n feig) s.t. =(e1 ; : : : ; ei ; : : : ejEC j) is a solution of the characteristicsystemKC ( 2 Sol(KC)).


Given a hyperboxI I denote by[L(e); U(e) theexecution time interval for the evente. Then for a conflictingevente denote byL(e) = maxe02e(L(e0)) + Use andU(e) = maxe02e(U(e0)) + Use the earliest respectivelythe latest time whene is forced to fire. We have that.

Proposition 3 C(I) is a legal time interval configuration ifthe following conditions hold true:

1. I eI such thatL(e) maxe02e(L(e0)) +Use andU(e) maxe02e(U(e0)) + Lse2. 8e 2 EC , 9e 2 EC s.t. e1e andL(e) L(e) andU(e) U(e).3. obs1 = eo for eo 2 EC , (eo) = l(obs1)4. 8e 2 CUT (C)) U(e) obs15. 8e 2 ENABLED(C))maxe02e(L(e0)) + Use obs1 .

Proof: The proof is lengthy and is omitted. 2In the following we present an algorithm that derives a

set of j EC j-hyperboxes,fI j 2 Vg (V the set of in-dexes) s.t. for eachj EC j-hyperboxI , C(I) is a le-gal time interval configuration and the union of the subsetsfSol(KC) j 2 Vg that are circumscribed byI is a coverof the entire solution setSol(KC), i.e.

S2V Sol(KC) =Sol(KC), whereSol(KC) = Sol(KC) \ I .The idea behind developing the algorithm that we propose

is as follows. First we calculate the hyperboxeI that circum-scribesSol( eKC). Then we should impose the timing con-straints imposed by the conditions2 5 in Proposition 3. Wehave three kinds of constraints. Denote byK onf , K0obs, andK00obs the set of constraints imposed by the set of conflictingevents (condition(2)), the equality constraint required by theobservation of the labellobs1 (condition(3)), and respectivelythe set of constraints that require that the time configurationis complete w.r.t. the timeobs1 (none of the concurrent partsof the process are left behind in time).

Consider a constrainte on the time intervaleI(e) =[eL(e); eU(e) of an evente 2 EC where:e := nI 0(e) = [L0(e); U 0(e) j L0(e) > eL(e) orU 0(e) < eU (e)oThe set of solutions ofeKC that satisfye, denotedSol( eKC ^e), is obtained propagating the constrainte for-

ward to its successors and backwards to its predecessors:

- forward propagation:for all e 2 e:L0(e) = max(eL(e) + Lse ; eL(e)) andU 0(e) = min(eU(e) + Use ; eU(e))- backward propagation:

i) for all e 2 e:U 0(e) = min(eU(e) Lse; eU(e))ii) for eache 2 e s.t. eL(e) Use > eU(e)

consider a different case 2 V 0:ii.1) L0(e) = eL(e) Useii.2) for all e 2 e, e 6= e : L0(e) = eL(e).

b1

b5

b7b4

b8

e2

e3 e6 e9

e11[2,4]

[4,9] [3,7][3,8]

[2,5]b11

b10

b2

e5[1,5]

e8[1,4]

e1[11,13]

b3 b6 b9b’10b’1

e12[11,14]

bb1 b’4 b’7b’10

e4[5,17]

e10[5,17]

Figure 3:

The backward propagation of a constrainte may requireto split anj EC j-hyperbox considering different cases. No-tice that the number of cases is not bigger than the number ofconcurrent predecessor events of the evente to whom the con-strainte is applied. For each hyperboxI0 , 0 2 V 0 the setof constraints is updated since in general it may be that newconstraints appear while some of the previous constraints aresatisfied. If a constraint cannot be imposed the case is abortedwhile if the set of constraints is empty the algorithm returnsan hyperbox that circumscribes a subset of solutions ofKC .

The constraint propagation algorithm works as follows:

1. first step is to impose the constraints of kindK0obs andK00obs (required by the received observation)

2. the second step is to impose for eachj EC j-hyperboxthat results after step 1, the set of constraintsK onf .E.g. for I consider that9 e 2 EC s.t. condition2in Proposition 3 is not satisfied. Then for eache 2 ECs.t. e1e consider a different case and impose a con-strainte := fL00(e) = L0(e)g if L0(e) L0(e) ore = fU 00(e) = U(e)g if U0(e) U0(e).

3. an arbitrary constrainte or e is selected and then itis imposed backwards. If new constraints appear onthe time intervals of the predecessor events ofe or ethen one of these constraints is selected and it is im-posed further backwards until a decision is achieved.Then constraints are propagated forward for thej EC j-hyperboxes that are not aborted. The maximum numberof different cases that result propagating recursively aconstraint backwards is smaller than the size of maxi-mum set of concurrent events in the configuration

4. a decision is achieved for each case in finite time sincethe corner points of eachj EC j-hyperbox are rationalnumbers and each constraint that is applied either re-duces one edge of thej EC j-hyperbox or returns suc-cess/abort.

Example 2 Consider for the configurationC displayed inFig. 3 that the first observation is received at the time13and consider the case whene4 is the event that was observed.Let 0e4 = fe4 = 13g. 04 is propagated backwards anda new constraint0e3 appears where00e3 = fIe4 = [5; 9g.0e3 is propagated backwards but no new constraints appears.Then e10 is required to be executed aftere4 = 13, i.e.00e10 = f10 2 [13; 17g. 00e10 is propagated backwards and


a constrainte9 appears where00e9 = fIe9 = [4; 8g. 00e9 ispropagated backwards and no new constraint appears.

Then the timing constraints required by the conflictingeventse1 ande12 are satisfied. What is left is the conflictingevente6. We have thate3e6 ande9e6 and I(e3) = [5; 9,I(e9) = [4; 8, andI(e6) = [3; 7.

We have two cases. First considere3e6. We havee6 =L0e6 = 5 ande3 = U 0e3 = 7. e6 is propagated back-wards and we have two cases: eitherI1(e5) = [2; 5 andI1(e8) = [1; 4 or I2(e5) = [1; 5 and I2(e8) = [2; 4.e3 = U 0e3 = 7 does not produce new constraints. We ob-tain two hyperboxes and if we consider the case whene9e6we obtain in a similar way another two hyperboxes.

7 The on-line diagnosisIn the previous sections we have presented the plant diagno-sis up to the first observation or in absence of any observationup the the first discarding time. Then the on-line diagnosisis performed calculating the plant behaviour up to a new dis-carding time.

Theorem 2 Given a TPN modelhN ;M0 i we have that:

1. when an observable event is executed:DRN (On) = fFg , DRpoN (On) = fFg2. for the first discarding time after the time when thenth

observed event is reported:DRN (On;) = fFg , DRpoN (On;) = fFg3. and in absence of any observation, the diagnosis result

w.r.t. the detection of the faults that for sure happenedcalculated any time in between the last observed eventand the discarding time is constant, i.e.8 2 [obsn ; ):DRN (On;) = fFg , DRpoN (On) = fFg.

Proof: The proof is similar to the proof of Theorem 1. 28 Final remarks and future workWe have derived in this paper on-line algorithms for the di-agnosis of TPN models. The plant behaviour is derived up toa discarding time, i.e. up to a time when in absence of anyobservation one can discard untimed support traces becausethey are not consistent with the plant behaviour. The analysisis based on partial orders and it requires to derive the solutionset of systems of(max;+)-linear inequalities.

We have presented two algorithms to derive the entire solu-tion set, one based on the ELCP and the second one based onconstraint propagation. Both algorithms are NP-hard prob-lems. Beside the number of events, the number of conflict-ing events, and the maximum number of predecessors respec-tively successors of a node in a configuration, the computa-tional complexity of both methods strongly depends on thestructure of the system.

However there are a few reasons that allow us to claim thatthe two methods are computationally more efficient than theones ([1], [5]) presented in the literature. Comparing withthe method based on the state class graph computation[5]our methods have the advantage that not all the interleav-ing of the concurrent events are considered. Moreover the

computational complexity depends in our case on the sizeof the largest subnet that contains unobservable transitionswhereas the computation complexity in[5] depends on thesize of the entire net. The algorithm in[1] solves a systemof (max;+)-inequalities enumerating all the cases for eachmax-term. This combinatorial approach is known in the lit-erature to be computational less efficient than the ELCP.

Finally notice that for the above example the ELCP pro-vides8 subsets while constraint satisfaction only finds 4 sub-sets. The reason is that each face of a polyhedron that satis-fies a cross-complementarity condition provides a legal timeinterval configuration but the converse is not true. The subsetof solutions that is circumscribed by the hyperbox of a timeinterval configuration may be obtained as a union of faces ofa polyhedron that satisfy a cross-complementarity condition.

However the set of hyperboxes obtained running the al-gorithm based on constraint propagation does not allow oneto calculate the minimum and maximum time separation be-tween the execution of two events unless a further refinementof the calculations is performed.

We plan to extend the methodology for a distributed settingwhere the strong assumptions considered in[6] to be relaxed.

References[1] T. Aura and J. Lilius. Time processes of Time Petri Nets.

ATPN’97 - LNCS, 1248, 1997.

[2] B. Berthomieu and M. Menasche. An enumerative ap-proach for analyzing Time Petri Nets.IFIP Congess,Paris, 1983.

[3] T. Chatain and C. Jard. Time supervision of concur-rent systems using symbolic unfoldings of Time PetriNets. Int. Conf. on Formal Modeling and Analysis ofTime Systems, Uppsala, Sweden, 2005.

[4] E. Fabre, A. Benvensite, S. Haar, and C. Jard. Dis-tributed monitoring of concurrent and asynchronoussystems.Journal of Discrete Event Dynamic Systems,15(1):33–84, March 2005.

[5] M. Ghazel, M. Bigand, and A. Toguyeni. A temporal-contraint based approach for monitoring of DiscreteEvent Systems under partial observation. InIFACCongress, Prague, 2005.

[6] G. Jiroveanu.Fault diagnosis for large Petri Nets. PhDthesis, Ghent University, Gent, Belgium, 2006.

[7] G. Jiroveanu, B. De Schutter, and R.K. Boel. Fault Di-agnosis for Time Petri Nets. InWorkshop on DiscreteEvent Systems (WODES’O6), Ann Arbor, USA, 2006.

[8] K. L. McMillan. Using unfoldings to avoid the statespace explosion problem in verification of asynchronouscircuits. In4th Int. Workshop on CAV, 1992.

[9] M. Sampath, R. Sengupta, S. Lafortune, S. Sinnamo-hideen, and D. Teneketzis. Diagnosability of DiscreteEvent Systems.IEEE-T on AC, 40(9), 1995.

[10] B. De Schutter and B. De Moor. The Extended Lin-ear Complementarity Problem.Mathematical Program-ming, 71(3):289–325, Dec. 1995.


Primary and Secondary Plan Diagnosis∗

Femke de Jonge and Nico Roos and Cees WitteveenDept. of Computer Science, Universiteit Maastricht, P.O.Box 616, NL-6200 MD Maastricht

fax: +31-43-3884897, F.deJonge,[email protected] EEMCS, Delft University of Technology, P.O.Box 5031, NL-2600 GA Delft

fax: +31-15-2786632, [email protected]

Abstract

Diagnosis of plan failures is an important subjectin both single- and multi-agent planning. Plan diag-nosis may provide information that can improve theway the plan failures are dealt with in three ways:(i) it provides information necessary for the adjust-ment of the current plan or for the development ofa new plan, (ii) it can be used to point out whichequipment and/or agents should be repaired or ad-justed so they will not further harm the plan execu-tion, and (iii) it can identify the agents responsiblefor plan execution failures.We introduce two general types of plan diagnosis:primary plan diagnosis identifying the incorrect orfailed execution of actions, and secondary plan di-agnosis that identifies the underlying causes of thefaulty actions. Furthermore, three special casesof secondary diagnosis are distinguished, namelyequipment diagnosis, environment diagnosis andagent diagnosis.

1 IntroductionIn multi-agent planning research there is a tendency to dealwith plans that become larger, more detailed and more com-plex. Clearly, as complexity grows, the vulnerability of plansfor failures will grow correspondingly. Taking appropriatemeasures to such plan failures requires knowledge on theircauses. So it is important to be able to detect both the occur-rence of failures and to determine the causes of them. There-fore, we consider diagnosis as an integral part of the capabil-ities of planning agents in single- and multi-agent systems.

In this paper we adapt and extend a classical Model-BasedDiagnosis (MBD) approach to the diagnosis of plan execu-tion. The system to be diagnosed consists not only of theplan and its execution, but also of the equipment needed forthe execution, the environment in which the plan is executedand the executing agents themselves. Therefore, the agents,

∗This research is supported by the Technology Foundation STW,applied science division of NWO and the technology programmeof the Ministry of Economic Affairs (the Netherlands). ProjectDIT5780: Distributed Model Based Diagnosis and Repair.

the equipment and the environment need also be the subjectof diagnosis.

To motivate the need for these different types of diagnosiswe distinguished, consider a very simple example in which apilot agent of an airplane participates in a larger multi-agentsystem for the Air Traffic Control of an airport. Suppose thatthe pilot agent is performing a landing procedure and that itsplan prescribes the deployment of the landing gear. Unfortu-nately, the pilot was forced to make a belly landing. Clearly,the plan execution has failed and we wish to apply diagnosisto find out why. A first, but superficial, diagnosis will pointout that the agent’s action of deploying the landing gear hasfailed and that the fault mode of this action is “landing gearnot locked”. We will denote this type of diagnosis as primaryplan diagnosis; this type of diagnosis focuses on set of faultbehaviors of actions that explain the differences between theexpected and the observed plan execution.

Often, however, it is more interesting to determine thecauses behind such faulty action executions. In our exam-ple, a faulty sensor may incorrectly indicate that the landinggear is already extended and locked, which led the pilot agentto the belief that the action was successfully executed. Wewill denote the diagnosis of these underlying causes as sec-ondary plan diagnosis. Secondary diagnosis can be viewedas a diagnosis of the primary diagnosis. It informs us aboutmalfunctioning equipment, unexpected environment changes(such as the weather) and faulty agents. As a special type ofsecondary diagnosis, we are also able to determine the agentsresponsible for the failed execution of some actions. In ourexample, the pilot agent might be responsible, but so mightbe the airplane maintenance agent.

In our opinion, diagnosis in general, and secondary diagno-sis in particular, enables the agents involved to make specificadjustments to the system or the plan as to manage currentplan-execution failures and to avoid new plan-execution fail-ures. These adjustments can be categorized with regard totheir benefits to the general system. First of all, diagnosisprovides information on how the plan behaves during exe-cution, which might contribute to a failure-free (re)planning.For example, we can imagine that the initial knowledge ofhow a dynamic environment may influence the plan executionis rather limited. Diagnosis may provide information that ex-pands the knowledge about plan execution. Secondly, a sec-ondary diagnosis can point out which equipment used for the


plan execution was malfunctioning. Broken equipment thencan be fixed to improve future plan execution. Moreover, ifthe amount of possible repairs is limited, diagnosis can indi-cate which repair has the most, positive, influence on futureplan execution. In this respect, agents can be viewed in thesame way as equipment: agents too can malfunction, eitherbecause of incorrect beliefs of the agent, or because the agentsomehow died (crashed). Secondary diagnosis can also pro-vide the information necessary to recover and adjust agentsthereby contributing to a better plan execution. Hence, it cancontribute to solving the well known qualification problem[McCarthy, 1977]. Finally, diagnosis can indicate the agentsresponsible (accountable) for the failures in the plan execu-tion. This information is very interesting when evaluating thesystem, and can also be used to divide costs of repairs and/orchanges in the plan amongst the agents.

To realize the benefits of plan-based diagnosis outlinedabove, we introduce an object-oriented view to describe plan-execution. Based on this model primary and secondary diag-nosis will be defined. The primary plan diagnosis more or lesscorresponds with the main aspects of diagnosis of plan execu-tion described by Witteveen and Roos [Witteveen et al., 2005;Roos and Witteveen, 2005]. To enable us to apply secondaryplan diagnosis, we expand their model such that it is not onlypossible to analyze the plan execution process, but also therole of the objects that influence the plan execution. The re-sulting model is specified in section 3 and consists of objectsrepresenting the plan and its execution, the equipment that isused for the plan execution, the environmental objects thatare somehow involved in the plan, and the agent executingthe plan. On this model, we can apply techniques inspiredby model-based diagnosis to find the primary diagnosis, asdescribed in subsection 4.1. The secondary diagnosis is pre-sented in subsection 4.2, while subsection 4.3 discusses theagent that are held responsible for the failed actions. But firstof all, we will place our approach into perspective by dis-cussing some approaches to plan diagnosis in the followingsection.

2 Related researchIn this section we briefly discuss some other approaches toplan diagnosis.

Birnbaum et al. [Birnbaum et al., 1990] apply MBD toplanning agents relating planning assumptions made by theagents to outcomes of their planning activities. However, theydo not consider faults caused by execution failures as a sepa-rate source of errors.

de Jonge et al. [de Jonge and Roos, 2004; de Jonge et al.,2005] present an approach that directly applies model-baseddiagnosis to plan execution. Here, the authors focus on agentseach having an individual plan, and on the conflicts that mayarise between these plans (e.g. if they require the same re-source). Diagnosis is applied to determine those factors thatare accountable for future conflicts. The authors, however, donot take into account dependencies between health modes ofactions and do not consider agents that collaborate to executea common plan.

Kalech and Kaminka [Kalech and Kaminka, 2003; 2004]

apply social diagnosis in order to find the cause of an anoma-lous plan execution. They consider hierarchical plans con-sisting of so-called behaviors. Such plans do not prescribe a(partial) execution order on a set of actions. Instead, basedon its observations and beliefs, each agent chooses the appro-priate behavior to be executed. Each behavior in turn mayconsist of primitive actions to be executed, or of a set of otherbehaviors to choose from. Social diagnosis then addressesthe issue of determining what went wrong in the joint execu-tion of such a plan by identifying the disagreeing agents andthe causes for their selection of incompatible behaviors (e.g.,belief disagreement, communication errors). Although we donot consider hierarchical plans of behaviors, social diagnosisis related to the here proposed agent diagnosis.

Lesser et al. [Carver and Lesser, 2003; Horling et al., 2001]also apply diagnosis to (multi-agent) plans. Their researchconcentrates on the use of a causal model that can help anagent to refine its initial diagnosis of a failing component(called a task) of a plan. As a consequence of using sucha causal model, the agent would be able to generate a new,situation-specific plan that is better suited to pursue its goal.While their approach in its ultimate intentions (establishinganomalies in order to find a suitable plan repair) comes closeto our approach, their approach to diagnosis concentrates onspecifying the exact causes of the failing of one single com-ponent (task) of a plan. Diagnosis is based on observations ofa component without taking into account the consequencesof failures of such a component w.r.t. the remaining plan. Inour approach, instead, we are interested in applying MBD-inspired methods to detect plan failures. Such failures arebased on observations during plan execution and may concernindividual components of the plan, but also agent properties.Furthermore, we do not only concentrate on failing compo-nents themselves, but also on the consequences of these fail-ures for the future execution of plan elements.

Witteveen et al. [Witteveen et al., 2005; Roos and Wit-teveen, 2005] show how classical MBD can be applied toplan execution. To illustrate the different aspects of diagnosisdiscussed in the introduction, below we present an adaptedand extended version of their formalization of plan diagnosis.This formalization enables the handling of the approaches ofde Jonge et al. [de Jonge and Roos, 2004; de Jonge et al.,2005], Kalech and Kaminka [Kalech and Kaminka, 2003;2004], and Lesser et al. [Carver and Lesser, 2003; Horlinget al., 2001]. The work of Birnbaum et al. [Birnbaum et al.,1990] is not covered by the proposed formalization since itfocuses on the planning activity instead of on plan execution.

3 PreliminariesObjects In [Witteveen et al., 2005] it was shown that byusing an object-oriented description of the world instead ofa conventional state-based description, it becomes possibleto apply classical MBD to plan execution. Here, we willtake this approach one step further by also introducing ob-jects for agents executing the plan and for the actions them-selves. Hence, we assume a set of objects O that will be usedto describe the plan, the agents, the equipment and the envi-ronment.


The objects O are partitioned into classes or types. Wedistinguish four general classes, namely: actions A, agentsAg, equipment E and environment objects N .

States and partial states Each object in o ∈ O is assumedto have a domain Do of values. The state of the objects O =o1, ..., on at some time point is described by a tuple σ ∈Do1 × ...×Don

of values. In particular, the states σA, σAg ,σE and σN are used to denote the state of the action objectsA, the agent objects Ag, the equipment objects E and theenvironment objects N , respectively.

The state σN of environment objects N describes the stateof the agents’ environment at some point in time. These statedescriptions can be the location of an airplane or the avail-ability of a gate.

The states σA, σAg and σE of action, agent and equipmentobjects, respectively, describe the health modes of these ob-jects for the purpose of diagnosis [Kleer and Williams, 1989;Struss and Dressler, 1989]. We assume that each of their cor-responding domains contains at least (i) the value nor to de-note that the action, agent and equipment objects behave nor-mally, and (ii) the general fault mode ab to denote that the ac-tion, agent and equipment objects behave in an unknown andpossibly abnormal way. Moreover, the domains may containseveral more specific fault modes. For instance, the domainof a ‘flight’ action may contain a fault mode indicating thatthe flight is 20 minutes delayed.1

It will not always be possible to give a complete state de-scription. Therefore, we introduce a partial state as an ele-ment π ∈ Doi1

×Doi2× . . .×Doik

, where 1 ≤ k ≤ n and1 ≤ i1 < . . . < ik ≤ |O|. We use O(π) to denote the setof objects oi1 , oi2 , . . . , oik

⊆ O specified in such a state π.The value of an object o ∈ O(π) in π will be denoted by π(o).The value of an object o ∈ O not occurring in a partial state πis said to be unknown (or unpredictable) in π, denoted by ⊥.Including ⊥ in every value domain Di allows us to considerevery partial state π as an element of D1 ×D2 × . . .×D|O|.

Partial states can be ordered with respect to their informa-tion content: given values d and d′, we say that d ≤ d′ holdsiff d = ⊥ or d = d′. The containment relationv between par-tial states is the point-wise extension of≤: π is said to be con-tained in π′, denoted by π v π′, iff ∀o ∈ O[π(o) ≤ π′(o)].Given a subset of objects S ⊆ O, two partial states π, π′ aresaid to be S-equivalent, denoted by π =S π′, if for everyo ∈ S, π(o) = π′(o). We define the partial state π restrictedto a given set S, denoted by π S, as the state π′ v π suchthat O(π′) = S ∩O(π).

An important notion for our notion of diagnosis is thecompatibility relation between partial states. Intuitively, twostates π and π′ are said to be compatible if they could re-fer to the same complete state. This means that they donot disagree on the values of objects defined in both states,i.e., for every o ∈ O either π(o) = π′(o) or at least oneof the values π(o) and π′(o) is undefined. So we defineπ and π′ to be compatible, denoted by π ≈ π′, iff ∀o ∈

1Note that in a more elaborate approach the value of, for instance,an equipment object may also indicate the location of the equipment.In this paper we only represent the health mode of the equipment.

O[π(o) ≤ π′(o) or π′(o) ≤ π(o)]. As an easy consequencewe have, using the notion of S-equivalent states, π ≈ π′ iffπ =O(π)∩O(π′) π′. Finally, if π and π′ are compatible states,they can be merged into the v-least state π t π′ containingthem both: ∀o ∈ O[π t π′(o) = max≤π(o), π′(o)].

Goals An (elementary) goal g of an agent specifies a set ofstates an agent wants to bring about using a plan. Here, wespecify each such a goal g as a constraint, that is, a relationover some product Di1× . . .×Dik

of domains. We say that agoal g is satisfied by a partial state π, denoted by π |= g, if therelation g contains some tuple (partial state) (di1 , di2 , . . . dik

)such that (di1 , di2 , . . . dik

) v π. We assume each agent ato have a set Ga of such elementary goals g ∈ Ga. We useπ |= Ga to denote that all goals in Ga hold in π, i.e. for allg ∈ Ga, π |= g.

Actions and action execution The set A of action objects,also called plan steps is partitioned into subclasses αi calledaction types or plan operators. Through the execution of aspecific action object a ∈ A, the state of environment objectsN and possibly also of equipment E objects may change. Wedescribe such changes of an action object a ∈ A by a (partial)function fα where α is the type of the action (plan operator)a is an instance of:

fα : Da ×Dag ×De1 × ...×Dei ×Dn1 × ...×Dnj →De′

1× ...×De′

k×Dn′

1× ...×Dn′

l

where a ∈ α ⊂ A is an action of type α, ag ∈ Ag isthe execution agent, e1, ..., ei ∈ E are the required equip-ment objects, n1, ..., ni ∈ N are the required environmentobjects, and e′1, ..., e′k, n′1, ..., n

′l ⊆ e1, ..., ei, n1, ..., nj

are equipment and environment objects that are changed bythe action a. Note that since the values of equipment ob-jects only indicate health modes of these objects we allowfor equipment objects in the range of fα in order be able todescribe repair and maintenance actions.

To distinguish the different types of parameters in a moreclear way, semicolons will be placed between them when theyappear in the argument of a function, e.g.:

f transport(driving : A; hal : Ag; truck : E ; goods : N ).

The objects whose value domains occur in dom(fα) will bedenoted by

domO(oa) = oa, oag, oe1 , ..., oei, on1 , ..., onj

and, analogously,

ranO(oa) = oe′1, ..., oe′

l, on′

1, ..., on′

j.

Moreover, we will use domAgO (oa), domE

O(oa) anddomN

O (oa) to denote oag, oe1 , ..., oei and on1 , ..., onj

respectively. Note that we use the action instance oa to de-note the objects involved in the execution of oa according tothe function fα with oa ∈ α.

The result of an action may not always be known if, forinstance, the action fails or if equipment is malfunctioning.


f transport

goods(location: Rotterdam)

truck(status: normal)

agent(status: normal)

driving(status: normal)state

agent(status: normal)

goods(location: New York)

truck(status: normal)

driving(status: normal)state

Figure 1: An action and its state transformation.

Therefore we allow that the function associated with an ac-tion maps the value of an object o to ⊥ to denote that theeffect of the action on o is unknown. In fact, we only requirethat the effect of an action is completely specified for all ob-jects in the function’s range if the action is executed in normalcircumstances. That is, the agent is capable of executing theplanned action given the planned equipment.

Figure 1 gives an illustration of the above outlined statetransformation as result of the application of a drive action.Note that in this example only the state of the goods ischanged as the result of the transport action. In the paperwe assume that an object representing equipment only indi-cates the health state of the equipment. In a more elaboratedapproach, the health state is only one of the attributes of theobject. Another attribute may indicate the location of the ob-ject. In Figure 1, the location of the truck changes also asresult of the drive action.

Plans A plan is a tuple P = 〈A,<〉 where A ⊆ A is asubset of plan steps (action objects) that need to be executedand < is a partial order defined on A × A where a < a′

indicates that the plan step a must finish before the plan stepa′ may start. Note that each plan step a ∈ A occurs exactlyonce in the plan P , while there may be several plan steps thatbelong to the same action type. We will denote the transitivereduction of < by, i.e., is the smallest sub-relation of <such that the transitive closure + of equals <.

We assume that if in a plan P two action instances a anda′ are independent, in principle they may be executed con-currently. This means that the precedence relation < at leastshould capture all resource dependencies that would prohibitconcurrent execution of actions. Therefore, we assume < tosatisfy the following concurrency requirement:

If ranO(a) ∩ domO(a′) 6= ∅ then a < a′ or a′ < a.

That is, for concurrent instances, domains and ranges do notoverlap. 2

2Note that since ranO(a) ⊆ domO(a), this requirement ex-cludes overlapping ranges of concurrent actions, but domains of con-current actions are allowed to overlap as long as the values of theobject in the overlapping domains are not affected by the actions.

a1 a2

a3 a4

a5

state

π0

π2

π3

a6

π1

state

state

state

o1 o2 o3 o4

Figure 2: Plans and action instances. Each state characterizesthe values of four objects o1, o2, o3 and o4. States are changedby application of action instances

Example Figure 2 gives an illustration of a plan. Since anaction object is applied only once in a plan, for clarity rea-sons, we will replace the function describing the behavior ofthe action by the name of the action. The arrows relate actionto the objects is uses as inputs and the objects it modifies asits outputs. In this plan, the dependency relation is specifiedas a1 a3, a2 a4, a4 a5, a4 a6 and a1 a5.Note that the last dependency has to be included because a5

changes the value of o2 needed by a1. The action a1 showsthat not every object occurring in the domain of an actionneed to be affected by the action. The actions a5 and a6 illus-trate that concurrent actions may have overlapping domains.

Plan execution For simplicity, we will assume that everyaction in a plan P takes a unit of time to execute. Weare allowed to observe the execution of a plan P at dis-crete times t = 0, 1, 2, . . . , k where k is the depth of theplan, i.e., the longest <-chain of actions occurring in P . LetdepthP (a) be the depth of action a in plan P = 〈A,<〉. Here, depthP (a) = 0 if a′ | a′ a = ∅ anddepthP (a) = 1 + maxdepthP (a′) | a′ a, otherwise. Ifthe context is clear, we often will omit the subscript P . Weassume that the plan starts to be executed at time t = 0 andthat concurrency is fully exploited, i.e., if depthP (a) = k,then execution of a has been completed at time t = k + 1.Thus, all actions a with depthP (a) = 0 are completed attime t = 1 and every action a with depthP (a) = k will bestarted at time k and will be completed at time k + 1. Notethat thanks to the above specified concurrency requirement,concurrent execution of actions having the same depth leadsto a well-defined result.

A timed state is a tuple (π, t) where π is a state and t ≥ 0a time point. We would like to consider the predicted ef-


fect (time state) (π′, t′) as the result of executing plan P ona given timed state (π, t). To define this relation in a pre-cise way, we will need the following concepts. First of all,let Pt denote the set of actions a with depthP (a) = t, letP>t =

⋃t′>t Pt′ , P<t =

⋃t′<t Pt′ and P[t,t′] =

⋃t′

k=t Pk.Secondly, we say that a plan step a is enabled in a state π

if domO(a) ⊆ O(π).Now we can predict the timed state (π′, t + 1) using the

timed state (π, t) and the set Pt of to be executed plan stepsas follows:

1. whenever an object o does not occur in the range of anaction a ∈ Pt, its value in state π′ is the same as its valuein π, i.e., π(o) = π′(o);

2. if the object o occurs in the range of an action a ∈ Pt thatis enabled in π, its value changes according to the func-tion specification, i.e., π′(o) = fα(π domO(a))(o).

Formally, we say that (π′, t + 1) is (directly) generated byexecution of P from (π, t), abbreviated by (π, t) →P (π′, t+1), iff the following conditions hold:

1. π′(o) = fα(π domO(a))(o) for each a ∈ Pt such thata ∈ α and for each o ∈ ranO(α).

2. π′(o) = π(o) for each o 6∈ ranO(Pt), that is, the valueof any object not occurring in the range of an action inPt should remain unchanged. Here, ranO(Pt) is a short-hand for the union of the sets ranO(a) with a ∈ Pt.

For arbitrary values of t ≤ t′ we say that (π′, t′) is (di-rectly or indirectly) generated by execution of P from (π, t),denoted by (π, t) →∗

P (π′, t′), iff the following conditionshold:

1. if t = t′ then π′ = π;2. if t′ = t + 1 then (π, t) →P (π′, t′);3. if t′ > t+1 then there must exists some state (π′′, t′−1)

such that (π, t) →∗P (π′′, t′ − 1) and (π′′, t′ − 1) →P

(π′, t′).

3.1 Normality assumptionsIn the above subsection, we defined the (expected) result ofa plan execution given the known states of several objects.In general, we do not know the state of every object. Moreparticularly, we do not know the health mode of the objectsaffected by an action unless we can directly verify the effectof the action execution. More in general, the results of planexecution are uncertain since we need not know the healthmode of actions, agents and equipment. Therefore, to predictthe effect of a plan execution, we must make assumptionsabout the (health) state of actions, agents and equipment. Wewill simply assume, that actions, agents and equipment are inthe state nor, unless we have information stating otherwise.Hence, to a given partial state π we add a set of default as-sumption δ specifying for actions, agents and equipment thatthey are executed or behaving normally.

Equivalently, such a set of assumptions δ associated with πspecifies a partial state πδ such that:• O(πδ) ⊆ O −O(π),• for each o ∈ O(πδ): πδ(o) = nor.

Note that normally, in the absence of actions that can sabotageequipment, the status of the equipment objects in O(πδ) ∩ Ewill not change during plan execution.

Using these assumptions δ, we can define the result of anormal execution of a plan P by extending the initial partialstate π at time point t = 0 with the state πδ and then con-sidering the timed state (π′, t′) as the result of executing Pon the timed state (π t πδ, 0). That is, (π′, t′) is the result ofnormal plan execution on (π, 0) iff (π t πδ, 0) →∗

P (π′, t′).

4 Plan diagnosisBy making (partial) observations at different time points ofthe ongoing plan execution we may establish that there arediscrepancies between the expected and the observed planexecution of the plan. These discrepancies indicate that theresults of executing one or more actions differs from the waythey were planned. Identifying these actions and, if possi-ble, what went wrong in the actions’ execution will be calledprimary plan diagnosis. Actions may fail because externalfactors such as changes in the environmental conditions (theweather), failing equipment or incorrect beliefs of agents.These external factors are underlying causes which are im-portant for predicting how the remainder of a plan will beexecuted. The secondary plan diagnosis aims at establishingthese underlying causes.

4.1 Primary plan diagnosisIn [Witteveen et al., 2005; Roos and Witteveen, 2005], Wit-teveen et al. describe how plan execution can be diagnosedby viewing action instances of a plan as components of a sys-tem and by viewing the input and output objects of an actionas in and outputs of a component. This made it possible toapply classical MBD to plan execution. Here, we will use amodified version of the plan diagnosis proposed by Witteveenet al.

Since a plan P = (A,<) is a partial order, actions (plansteps) in A are executed only once. Therefore, we could de-fine a primary diagnosis in which the execution of some ac-tions may fail, using the set of default assumption δ. How-ever, for other types of diagnosis such as diagnosis of equip-ment such an approach does not suffice. One of the reasons isthat e.g. equipment may start malfunctioning during the ex-ecution of some action instance and not as the result of it.In general, there may be quite a number of abnormalities thatcannot be attributed to the malfunctioning of an action. So wedefine the more general notion of a qualification κ, consistingof triples (oj , d, t) each specifying an object oj , the value dof the object and the time point t at which the object oj takesthis value d.

In case of primary diagnosis, the qualification κ is usedto change the value (the health mode) of plan steps. Hence,the triples have the form (a, d, depth(a)) with a ∈ A andd ∈ Da. Note that the plan diagnosis defined in [Witteveenet al., 2005] is a special case of primary diagnosis where thequalification κ consists of triples (a, ab, depth(a)) and wherefor the general fault mode ab the behavior of the action isunknown.


a1 a2

a3 a4

a6

t=3

π0

π2

π3

a8

a5

⊥

a7

π1

t=2

t=1

t=0

o2 o3 o4 o5o1

Figure 3: Plan execution with abnormal actions.

Using qualifications, we say that (π′, t + 1) is (directly)generated by execution of P from (π, t) given the qualifica-tion κ, abbreviated by (π, t) →κ;P (π′, t + 1), iff the follow-ing conditions hold for:

1. π′′(o) = d for each o ∈ O if (o, d, t) ∈ κ,else π′′(o) = π(o).

2. (π′′, t) →P (π′, t + 1).

for some auxiliary state π′′.For arbitrary values of t ≤ t′ we say that (π′, t′) is (directly

or indirectly) generated by execution of P from (π, t) giventhe qualification κ, denoted by (π, t) →∗

κ;P (π′, t′), iff thefollowing conditions hold:

1. if t = t′ then π′ = π;

2. if t′ = t + 1 then (π, t) →κ;P (π′, t′);

3. if t′ > t+1 then there must exists some state (π′′, t′−1)such that (π, t) →∗

κ;P (π′′, t′−1) and (π′′, t′−1) →κ;P

(π′, t′).

Example Figure 3 gives an illustration of an execution of aplan. Suppose action a3 is abnormal and generates a re-sult that is unpredictable (⊥). Given the qualification κ =(a3, ab, 1) and the partially observed state π0 at time pointt = 0, we predict the partial states πi as indicated in Figure 3,where (π0, t0) →∗

κ;P (πi, ti) for i = 1, 2, 3. Note that sincethe value of o1 and of o5 cannot be predicted at time t = 2,the result of action a6 and of action a8 cannot be predictedand π3 contains only the value of o3.

Suppose now that we have a (partial) observation obs(t) =(π, t) of the state of the world at time t and an observationobs(t′) = (π′, t′) at time t′ > t ≥ 0 during the execution ofthe plan P . We would like to use these observations to in-fer the health states of the actions occurring in P . Assuminga normal execution of P , we can (partially) predict the stateof the world at a time point t′ given the observation obs(t):if all actions behave normally, we predict a partial state π′∅at time t′ such that (π t πδ, t)→∗

P (π′∅, t′). Since we do notrequire observations to be made systematically, O(π′) and

O(π′∅) might only partially overlap. Therefore, if this as-sumption holds, the values of the objects that occur in boththe predicted state and the observed state at time t′ shouldmatch, i.e, we should have

π′ ≈ π′∅.

If this is not the case, the execution of some action instancesmust have gone wrong and we have to determine an actionqualification κ such that the predicted state derived using κagrees with π′. This is nothing else then a straight-forwardextension of the diagnosis concept in MBD [Reiter, 1987;Console and Torasso, 1991] to plan diagnosis:

Definition 1 Let P = 〈A,<〉 be a plan with observationsobs(t) = (π, t) and obs(t′) = (π′, t′), where t < t′ ≤depth(P ) and let the action qualification κ be a set of triples(a, d, depth(a)) with a ∈ A and d ∈ Da. Moreover, let(π t πδ, t)→∗

κ;P (π′κ, t′) be a derivation assuming an actionqualification κ.

Then κ is said to be a primary plan diagnosis (action diag-nosis) of 〈P, obs(t), obs(t′)〉 iff π′ ≈ π′κ.

So in a primary plan diagnosis κ, the observed partial stateπ′ at time t′ and the predicted state π′κ at time t′ assumingthe action qualification κ agree upon the values of all objectsO(π′) ∩O(π′κ) occurring in both states.

Example Consider again Figure 3 and suppose that we didnot know that action a3 was abnormal and that we observedobs(0) = ((d1, d2, d3, d4), 0) and obs(3) = ((d′1, d

′3, d

′5), 3).

Using the normal plan derivation relation starting with obs(0)we will predict a state π′∅ at time t = 3 where π′∅ =(d′′1 , d′′2 , d′′3 ,⊥4,⊥5). If everything is ok (κ = ∅), the val-ues of the objects predicted as well as observed at timet = 3 should correspond, i.e. we should have d′j = d′′j forj = 1, 3. If, for example, only d′1 would differ from d′′1 ,then we could qualify a6 as abnormal, since then the pre-dicted state at time t = 3 using κ = (a6, ab, 2) would beπ′κ = (⊥1,⊥2, d

′′3 ,⊥4,⊥5) and this partial state agrees with

the observed state.

Note that for all objects in O(π′) ∩ O(π′κ), the qualifica-tion κ provides an explanation for the observation π′ made attime point t′. Hence, for these objects the qualification pro-vides an abductive diagnosis [Console and Torasso, 1990].For all observed objects in O(π′) − O(π′κ), no value can bepredicted given the qualification κ. Hence, by declaring themto be unpredictable, possible conflicts with respect to theseobjects if a normal execution of all actions is assumed, are re-solved. This corresponds with the idea of a consistency-baseddiagnosis [Reiter, 1987].

Diagnosing a sequence of observations In the previoussection we described how to diagnose the executions of a planbetween two observations at different time points. Here, theobservation at the earliest time point corresponds to observedinputs of a system in classical Model-Based Diagnosis whilethe observations at the latest time point corresponds to theobserved outputs in classical Model-Based Diagnosis. Dur-ing the execution of a plan, however, we may make observa-tions at more than two time points during the execution of the


plan. Unless we observe the complete state of the world ateach of these time points, we cannot use successive pairs ob-servations to make the best possible diagnosis of the part ofthe plan executed between these time points. Hence, we mustextend our definition of plan diagnosis to handle sequencesof observations.

The use of a sequence of partial observations implies that adiagnosis of the part of a plan executed between time points tiand ti+1 may lead to predictions for the unobserved objectsat ti+1 that are relevant for diagnosing the part of the planexecuted between ti+1 and ti+2. Hence, a qualification of theactions executed between two time points ti and ti+1 dependson the qualification of actions executed before ti.

Definition 2 Let P = 〈A,<〉 be a plan with observationsobs(t1) = (π1, t1), ..., obs(tk) = (πk, tk), where t1 < t2 <... < tk ≤ depth(P ). Moreover, let κ be an action qualifica-tion.

The action qualification κ is said to be a plan diagnosis of〈P, obs(t1), ..., obs(tk)〉 iff

• (π1 t πδ, t1) →∗κ;P (π′2, t2),

• (πi t π′i, ti) →∗κ;P (π′i+1, ti+1) for 1 < i < k, and

• πi ≈ π′i for 1 < i ≤ k.

4.2 Secondary plan diagnosisActions may fail because of unforeseen (environmental) con-ditions such as being struck by lightning, malfunctioningequipment or incorrect beliefs of agents. Diagnosing thesesecondary causes is more difficult since weather, equipmentand agents may play a role in the execution of several ac-tions. Moreover, objects such as equipment and weather maygo through several unforeseen mode changes.

The above introduced qualification for primary diagnosiscan also be used for secondary diagnosis. In fact, we didnot use the default assumptions to model qualifications of ac-tion in order to have a uniform representation for both failingactions and underlying causes. A secondary qualification κconsists of triples (oj , d, t) where oj ∈ O − A is an objectthat changes to the value d ∈ Dj at time point t. Usuallywe choose for the time point t the depth depth(a) of the firstaction instance where change manifests itself. So, for someaction a, t = depth(a) and oj ∈ domO(a).

An object such as an airplane may have several (fault)modes. Between these modes transitions are possible. Forexample, continuing to drive an overheated engine will causemore severe damage, namely a completely ruined engine. Ofcourse, not every transition between the (fault) modes is valid.For example, a truck with a broken engine cannot become atruck with only a flat tyre without first repairing the truck’sengine. Hence, we need Discrete Event Systems [Cassandrasand Lafortune, 1999] to represent equipment or objects suchas the weather.

The specification of the discrete event system consists ofthe values Do of an object o, the events (o, d, t) ∈ κ thatchange the value of the object o, and a transition functiondescribing for object o the set of valid transitions. Hence,we assume that for every object oj ∈ O a transition functiontrj : Dj → 2Dj has been specified. This transition function

describes how the value of an object may change due to, forthe agent, unknown events. One of the goals of diagnosis isto determine some of these unknown events.

The values of some objects in the environment may onlychange due to the execution of actions. For these objects oj ,the transition function is the identity function; i.e.: trj(d) =d for every d ∈ Dj . The identity function disallows anychange in the object’s value that is not the result of an action.

Since the transition function places restriction on the pos-sible transitions of an object, we have to adapt the first itemof the specification of (π, t) →κ;P (π′, t + 1).

1. π′′(j) = d if (oj , d, a) ∈ κ, d ∈ trj(π(j)) and t =depth(a), else π′′(j) = π(j).

2. (π′′, t) →P (π′, t + 1)Definition 3 Let P = 〈A,<〉 be a plan with observationsobs(t) = (π, t) and obs(t′) = (π′, t′), where t < t′ ≤depth(P ) and let the action qualification κ be a set of triples(o, d, t) with o ∈ O − A and d ∈ Da. Moreover, let(π t πδ, t)→∗

κ;P (π′κ, t′) be a derivation assuming a quali-fication κ and the transition functions trj : Dj → 2Dj foreach object oj ∈ O.

Then qualification κ is said to be a secondary plan diagno-sis of 〈P, obs(t), obs(t′)〉 iff π′ ≈ π′κ.

The secondary diagnosis can be divided into agent, equip-ment and environment diagnosis depending on whether theobject o in a triple (o, d, t) ∈ κ belongs to Ag, E or N re-spectively.

An interesting special case of secondary diagnosis is agentdiagnosis. Agents may incorrectly execute an action be-cause of wrong internal beliefs about the agents’s environ-ment or about how actions should be executed. One possi-ble cause of such wrong beliefs are incorrect observationsof malfunctioning equipment such as sensors. In princi-ple, an agent’s incorrect beliefs can be can be modeled us-ing the agent’s state. Hence, we need an agent qualifica-tion (oj , d, t) with oj ∈ Ag describing the incorrect be-liefs of an agent that have led to the incorrect executionof actions. This can especially be the case if an actionmust be achieved by choosing appropriate behaviors. Notethat agent diagnosis is closely related to social diagnosis de-scribed by Kalech and Kaminka [Kalech and Kaminka, 2003;2004].

4.3 Applications of diagnosisAs mentioned in the introduction, the information providedby primary and secondary diagnosis can be used to improvethe way agents deal with plan failures.

First, to adjust the planning after plan failure, we needan analysis of the expected future execution of the plan andwhether the goals will still be reached.Secondary plan diag-nosis enables us to the determine which future actions mayalso be effected by the malfunctioning agents and equipment,and by unforeseen state changes in the environment.Definition 4 Let t be the current time point and let κ be asecondary diagnosis of the plan executed sofar. Then the setof future actions that will be effected given the current diag-nosis κ is:

a ∈ A | (oj , d, t′) ∈ κ, oj ∈ domO(a), d 6= nor, depth(a) ≥ t′


Besides identifying the actions that will be effected by agentand equipment failure or by unexpected changes in the en-vironment, we can also determine the goals that can still bereached.

Definition 5 Let t be the current time point, let π current par-tial state and let κ be an secondary diagnosis of the plan exe-cuted sofar. Moreover, let (π, t) →κ;P (π′, depth(P )). Thenthe set of goals that can still be realized is given by:

g ∈ G | π′ |= g

Second, based on the equipment diagnosis, the agents canpoint out which equipment should be repaired. Moreover, wecan view repairs as events that change an equipment objectfrom a failed state into a normal state. Then, we can usedefinitions 4 and 5 to verify the consequences a certain repairhas. This way, agents can consider which repair to choose ifrepairs are limited (e.g., due to their costs).

Third, it is also important to know the agents responsiblefor the failures. This information can contribute to negotia-tion on repairs of plan failure, to division of costs of failedplans or of plan repair, and to avoiding failures of futureplans.

As an illustration of different agents that can be responsiblefor a plan-execution failure, reconsidering the example in theintroduction where the agent responsible for the belly landingcan be the pilot agent, the maintenance agent, or the airlineagent that reduced the maintenance budget.

Here we will present a very simple model of responsibility.We introduce a responsibility function res : (O−N ) → Agspecifying the agent that is responsible for each of the action,agent and equipment objects.

Definition 6 Let κ be any diagnosis of a plan execution andlet res : (O −N ) → Ag be a responsibility function.

Then for each event (o, d, t) ∈ κ, the responsible agent isdetermined by: res(o).

5 ConclusionThis paper describes a generalization of the model for plandiagnosis as presented in [Witteveen et al., 2005; Roos andWitteveen, 2005]. New in the current approach is (i) the in-troduction of primary and secondary diagnosis, and (ii) the in-troduction of objects representing actions, agents and equip-ment. The primary diagnosis identifies failed actions and pos-sibly in which way they failed while the secondary diagnosisaddresses the causes for action failures. The latter is an im-provement over the plan diagnosis presented in [Witteveen etal., 2005; Roos and Witteveen, 2005], where only dependen-cies between action failures could be described using causalrules. An additional feature of the here proposed approachis that all objects can be modeled as discrete events systems.This enables the description of the unknown dynamic behav-ior of objects such as equipment over time. The secondary di-agnosis then identifies the events behind the state changes ofthese objects. The results of primary and secondary diagnosiscan be used to predict future action failures, to determine thegoals that can still be reached and to identify the agents thatcan be held responsible for plan-execution failures.

References[Birnbaum et al., 1990] L. Birnbaum, G. Collins, M. Freed,

and B. Krulwich. Model-based diagnosis of planning fail-ures. In AAAI 90, pages 318–323, 1990.

[Carver and Lesser, 2003] N. Carver and V.R. Lesser. Do-main monotonicity and the performance of local solutionsstrategies for cdps-based distributed sensor interpretationand distributed diagnosis. Autonomous Agents and Multi-Agent Systems, 6(1):35–76, 2003.

[Cassandras and Lafortune, 1999] C. G. Cassandras andS. Lafortune. Introduction to Discrete Event Systems.Kluwer Academic Publishers, 1999.

[Console and Torasso, 1990] L. Console and P. Torasso. Hy-pothetical reasoning in causal models. International Jour-nal of Intelligence Systems, 5:83–124, 1990.

[Console and Torasso, 1991] L. Console and P. Torasso. Aspectrum of logical definitions of model-based diagnosis.Computational Intelligence, 7:133–141, 1991.

[de Jonge and Roos, 2004] F. de Jonge and N. Roos. Plan-execution health repair in a multi-agent system. In Plan-SIG 2004, 2004.

[de Jonge et al., 2005] F. de Jonge, N. Roos, and H.J.van den Herik. Keeping plan execution healthy. In Multi-Agent Systems and Applications IV: CEEMAS 2005, LNCS3690, pages 377–387, 2005.

[Horling et al., 2001] Bryan Horling, Brett Benyo, and Vic-tor Lesser. Using Self-Diagnosis to Adapt OrganizationalStructures. In Proceedings of the 5th International Confer-ence on Autonomous Agents, pages 529–536. ACM Press,2001.

[Kalech and Kaminka, 2003] M. Kalech and G. A. Kaminka.On the design ov social diagnosis algorithms for multi-agent teams. In IJCAI-03, pages 370–375, 2003.

[Kalech and Kaminka, 2004] M. Kalech and G. A. Kaminka.Diagnosing a team of agents: Scaling-up. In AAMAS 2004,2004.

[Kleer and Williams, 1989] J. de Kleer and B. C. Williams.Diagnosing with behaviour modes. In IJCAI 89, pages104–109, 1989.

[McCarthy, 1977] John L. McCarthy. Epistemological prob-lems of artificial intelligence. In IJCAI, pages 1038–1044,1977.

[Reiter, 1987] R. Reiter. A theory of diagnosis from firstprinciples. Artificial Intelligence, 32:57–95, 1987.

[Roos and Witteveen, 2005] N. Roos and C. Witteveen. Di-agnosis of plans and agents. In Multi-Agent Systems andApplications IV: CEEMAS 2005, LNCS 3690, pages 357–366, 2005.

[Struss and Dressler, 1989] Peter Struss and Oskar Dressler.”physical negation” integrating fault models into the gen-eral diagnostic engine. In IJCAI, pages 1318–1323, 1989.

[Witteveen et al., 2005] C. Witteveen, N. Roos, R. van derKrogt, and M. de Weerdt. Diagnosis of single and multi-agent plans. In AAMAS 2005, pages 805–812, 2005.


Getting the Probabilities Right for Measurement Selection

Johan de KleerPalo Alto Research Center

3333 Coyote Hill Road, Palo Alto, CA 94304 USA

Abstract

The core objective of model-based diagnosis is toidentify candidate diagnoses which explain the ob-served symptoms. Usually there are multiple suchcandidate diagnoses and a model-based diagnosticengine proposes additional measurements to betterisolate the actual diagnosis. An objective of suchan algorithm is to identify this diagnosis in min-imum average expected cost (e.g., the sum of thecosts of the measurements). Minimizing this costrequires having accurate probability estimates forthe candidate diagnoses. Most diagnostic enginesutilize sequential diagnosis combined with BayesRule to determine the posterior probability of acandidate diagnosis given a measurement outcome.Unfortunately, one of the terms of Bayes rule, theconditional probability of a measurement outcomegiven a candidate diagnosis, must often be esti-mated (noted asε in most formulations). This paperpresents a reformulation of the sequential diagnosisprocess used in diagnostic engines and shows howdifferentε policies lead to varying results.

1 IntroductionModel-based diagnosis has been applied to a wide range ofapplications including automobiles[Struss and Price, 2004],spacecraft[Williams and Nayak, 1996], mobile robots[Stein-bauer and Wotawa, 2005] and software[Kob and Wotawa,2004] to mention just a few. The core objective of model-based diagnosis is to identify candidate diagnoses which ex-plain the observed symptoms. Usually there are multiple suchdiagnoses and a model-based diagnostic engine proposes ad-ditional measurements to better isolate the actual diagnosis.An objective of such an algorithm is to identify this diagnosisin minimum average expected cost (e.g., the sum of the costsof the measurements). Minimizing this cost requires havingaccurate probability estimates for the candidate diagnoses.Most diagnostic engines utilize a greedy sequential diagnosiscombined with Bayes Rule to determine the posterior proba-bility of a candidate diagnosis given a measurement outcome.Unfortunately, one of the terms of Bayes rule, the conditionalprobability of an measurement outcome given a candidate di-agnosis, must often be estimated (noted asε in most formu-

lations). This paper presents a reformulation of the sequentialdiagnosis process used in most diagnostic engines and showsthe results of variousε-policies. In order to minimize possi-ble confounding of different domain models and to have easyaccess to many examples we draw all our examples from awidely available combinatorial logic test suite from ISCAS-85 [Brglez and Fujiwara, 1985].

In order to focus on the impact of varyingε-policies wemake the following assumptions. (All the assumptions canbe relaxed, but would confound the results.): (1) All mea-surements have equal cost, (2) No intermittent faults, (3) Nomulti-step lookahead, (4) The inference engine used to de-rive the consequences of observations is complete, (5) All thesystem’s inputs are known, (6) One symptomatic output isgiven, (7) Time is not modeled, (8) The system has at mosttwo faults, (9) The behavioral model for each component iscompletely described, (10) The system is well-formed (nounattached inputs or outputs or cycles).

2 GDE Probability Framework

This basic framework is described in[de Kleer and Williams,1987; de Kleeret al., 1992].

Definition 1 A system is a triple (SD,COMPS, OBS) where:

1. SD, the system description, is a set of first-order sen-tences.

2. COMPS, the system components, is a finite set of con-stants.

3. OBS, a set of observations, is a set of first-order sen-tences.

Definition 2 Given two sets of componentsCp and Cn de-fineD(Cp,Cn) to be the conjunction:

[ ∧

c∈Cp

AB(c)]∧

[ ∧

c∈Cn

¬AB(c)].

WhereAB(x) represents that the componentx is ABnormal(faulted).

A diagnosis is a sentence describing one possible state ofthe system, where this state is an assignment of the statusnormal or abnormal to each system component.


Definition 3 Let ∆ ⊆COMPS. A diagnosis for(SD,COMPS,OBS) isD(∆, COMPS − ∆) such thatthe following is satisfiable:

SD ∪OBS ∪ D(∆, COMPS −∆)Components are assumed to fail independently. Therefore,

the prior probability a particular diagnosisD(Cp,Cn) is cor-rect is thus:

p(D) =∏

c∈Cp

p(c)∏

c∈Cn

(1− p(c)), (1)

wherep(c) is the prior probability that componentc is faulted.The posterior probability of a diagnosisD after an obser-

vation thatx has valuev is given by Bayes Rule:

p(D|x = v) =p(x = v|D)p(D)

p(x = v). (2)

p(D) is determined by the preceding measurements and priorprobabilities of failure. The denominatorp(x = v) is a nor-malizing term that is identical for allp(D) and thus need notbe computed directly. Thus the only term remaining to beevaluated in the equation isp(x = v|D) :

p(x = v|D) = 1 if x = v follows from D, SD,

p(x = v|D) = 0 if D, SD, (x = v) are inconsistent.

If neither holds,

p(x = v|D) = εikp(D), (3)

whereεik = 1m . This corresponds to the intuition that ifx

ranges overm possible values, then each possible value isequally likely. In digital circuitsm = 2 and thusε = .5.

Consider other possible values forεik. As ε approaches0, some diagnoses would be assigned far smaller posteriorprobabilities which would lead to inaccurate conclusions andexcessive measurement cost. For example, multiple faultswould be assigned far smaller probability than is actually thecase. So long asε > 0 the GDE algorithm will identify thecorrect diagnosis after sufficient measurements (ε = 0 wouldassign0 probability to correct diagnoses). Asε approaches1,there would be little need to use Bayes Rule and the relativelikelihoods of any two diagnoses would always be a constant.This would force GDE to consider very unlikely candidate di-agnoses. Looked at differently, asε varies from0 to 1 approx-imates the spectrum of abductive-based to consistency-baseddiagnostic frameworks[Brusoniet al., 1998]. ε clearly mustlie between0 and1, but should it beε = 1

m?There are a number of reasons that a candidate diagnosis

might fail to predict a value for a measured variable.

• Incompleteness in the inference engine used (e.g.,GDE’s).

• Incompleteness in the component models.

• The model may predict a disjunction of values (as canbe the case in qualitative models).

• Lack of knowledge of the actual faulty behavior of acomponent.

Although the lack of inferential completeness is common inmodel-based diagnosis engines, in this paper we focus on thelast source and use complete models and a complete infer-ence procedure. Note that lack of knowledge of the faultybehavior of a component does not necessarily imply that acandidate diagnosis fails to predicts some value. AminimaldiagnosisD(B, G) is one where there is no other diagno-sis D(B′, G′) whereB′ is a proper subset ofB. Minimalcandidate diagnoses often predict every variable value. Un-der the assumptions of this paper, minimal candidates alwaysassign a value to every variable. Consider a minimal diag-nosisD(B, G) whereb ∈ B. As there are no fault models,there is no model to predict a value for the output(s) ofb.AsD(B, G) is minimal we know thatD(B − b, G ∪ b)cannot be consistent, therefore the input/outputs observed orinferred aroundb are inconsistent with the correct behavior ofb. In the case of binary valued variables, this means that theoutput ofb must be the opposite of whatb’s behavioral modelpredicts (i.e., if the variable can’t be0, it must be1 and viceversa). Hence, failure to assign a variable value only occurswith multiple faults (or, with multi-valued quantities such as+,−, 0).

3 Using anε-policyIn order to avoid excessive computational cost, many diag-nostic algorithms utilize a greedy minimum entropy approachto select the best next measurement to make next (i.e., the onewhich, on average, minimizes the cost of identifying the cor-rect diagnosis).[de Kleer and Williams, 1987] shows how ex-pected entropy outcomes for hypothetical measurements canbe determined without additional inference. The outcome canbe calculated directly from the current probability distributionof measurement outcomes givenεik = 1

m . We now general-ize this approach to allow an arbitraryε-policy. Fortunately,the outcomes can be evaluated directly in the general case aswell. Given a set of diagnoses,DIAGNOSES, and assum-ing all measurements are of unit cost,

H = −∑

D∈DIAGNOSES

p(D) log p(D), (4)

estimates the number of measurements needed to complete adiagnosis. We define,

Sik = D ∈ DIAGNOSES| D∪SD∪OBS ` xi = vik,Ui = D ∈ DIAGNOSES|D 6∈ Sikfor anyk.

p(Sik) =∑

Cj∈Sik

pj ,

p(Ui) =∑

Cj∈Ui

pj ,

p(xi = vik) = p(Sik) + εikp(Ui).εik is determined by the diagnostic policy, under the restric-tion thatΣm

k=1εik = 1 for all i. The expected entropy aftermeasuringxi = vik is:

He(xi) =m∑

k=1

p(xi = vik)H(xi = vik). (5)


Let p′ be the probability after making the measurement. Sub-stituting equation 2 into equation 4 gives:

H(xi = vik) = −∑

l∈Sik∪Ui

p′l log p′l

= −∑

l∈Sik

pl

p(xi =vik)log

pl

p(xi =vik)

−∑

l∈Ui

εikpl

p(xi =vik)log

εikpl

p(xi =vik)

SubstitutingH into this equation gives:

He(xi) = −m∑

k=1

∑

l∈Sik

pllogpl

p(xi = vik)

−m∑

k=1

∑

l∈Ui

(εikpl)logεikpl

p(xi = vik).

Expanding the logarithms:

He(xi) = −m∑

k=1

∑

l∈Sik

pllog pl

+m∑

k=1

∑

l∈Sik

pllog p(xi = vik)

−m∑

k=1

∑

l∈Ui

εikpllog pl −m∑

k=1

∑

l∈Ui

εikpllog εik

+m∑

k=1

∑

l∈Ui

εikpllog p(xi = vik).

The first and third terms are simply the current entropyHand is necessarily constant. The second and fifth terms arethe negative entropy (i.e., of the probability density distribu-tion of xi. The expected entropyHe(xi) to minimize has thefollowing form:

H +m∑

k=1

p(xi =vik)log p(xi =vik)− p(U)m∑

k=1

εiklogεik

The best proposed measurement is the one which maxi-mizes information gain:

−m∑

k=1

p(xi =vik)log p(xi =vik) + p(U)m∑

k=1

εiklogεik.

Expected information gain is the expected reduction in num-ber of additional measurements needed to isolate the true di-agnosis and always lies between0 and1. There is thus noneed to utilize additional inferential machinery to hypothesizethe results of possible measurement outcomes. Given a policyfor distributingp(Ui), the proposed measurement can be eval-uated directly from known probabilities. Furthermore, if theε-policy is fixed, then

∑mk=1 εiklogεik is constant throughout

the diagnosis task.

Table 1: Expected costs information gains for cascaded in-verters after measurements (withp = .01).

a = 0 a = 0, e = 0ε = .01 ε = .5 ε = 1 ε = .01 ε = .5 ε = 1

a 0 0 0 0 0 0b .05 .02 0 .69 .59 0c .10 .04 0 .70 .68 0d .13 .06 0 .69 .59 0e .16 .07 0 0 0 0

4 Advantages of the GDE FrameworkOne of the fundamental advantages of the GDE framework isthat it is unnecessary to enumerate all the possible fault modesbeforehand. Thus a diagnostic algorithm can successfully di-agnose a system having never-before-seen faults. These faultmodes are a challenge to more conventional diagnostic ap-proaches which require far more prior knowledge of all thesystem’s fault modes.

GDE’s probabilistic framework allows it to identify thebest measurement to make next to localize the system’s fault.Consider the simple four inverter circuit of Figure 1 anda = 0.

Figure 1: Four sequential inverters.

To see the effects of differentεik ’s consider the some sim-plistic policies.Table 1 lists expected costs of measuring allthe variables, first aftera = 1, and then aftere = 0. The costsare given for three values ofεik. Note thatε = 1 is equivalentto using no probabilistic information at all, and as a conse-quence the resulting costs cannot be used to rank proposedmeasurements. As long as0 < ε < 1, GDE can eventuallyidentify good measurements to make next.

5 Using a Fixedε-policyWe have implemented a new diagnostic algorithm calledεGDE which accepts an arbitraryε-policy and is logicallycomplete (it identifies all conflicts and all variable value pre-dictions efficiently). It is provided anε-policy, fault probabil-ities, system model, and a set of input vectors and symptoms.Given this as input,εGDE, computes the average expectedcost to diagnose the system. There is one source of stochasticvariability in εGDE. When it encounters multiple measure-ment choices of approximately (5% to normalize for round-off errors) equal costs, it chooses randomly among them.

Consider a fixed policy in where theeik are fixed for allk. And whereΣm

k=1εik = 1. The data in Figure 2 showsthe average expected costs withεi0 increasing in.05 steps.εi1 = 1 − εi0. All components fail with equal probabilityp = 0.0001. Circuit c432 has 160 components. The datawas gathered with248 randomly generated double faults each


with a fully populated input vector with one identified symp-tom. Figure 2 illustrates the diagnostic cost for each fixedε-policy. For this device, diagnostic cost is minimum ifε isnearly1 for all 1 values, and nearly0 for 0 values. This isvery different than theε = .5 estimate of GDE.

Figure 2: Average cost vs.ε for 1 for circuit c432 with .95confidence interval. c432 has160 gates and is a 27-channelinterrupt controller from the ISCAS-85 test suite.

Figure 3: Average cost vs.ε for 1 for circuit c499. c499 has201 gates and is a 32-bit single-error-correcting-circuit fromthe ISCAS-85 test suite.

Figure 3 shows the results on circuit c499 which has 202components. The diagnostic task is to isolate a double faultfrom among all possible double faults. For this task, there isa sharp notch around.5 which corresponds to GDE’s esti-mate. Figure 4 shows the results on circuit c880 which has383 components. This likewise has a notch at.5.

6 Towards a Dynamicε-policyFigures 2, 3 and 4 suggest adaptiveε-policies can improvediagnostic costs. We would like to devise a dynamicε-policyappropriate to each diagnostic task. It is also important thatthis policy be easy to compute, otherwise it competes withthe alternative of expensive multi-step lookahead. Considerthe oversimplistic example of a single inverter (sayA in Fig-ure 1). Assume thatp(¬AB(A)) = α and we measureda = 0 and b = 1. The priors arep(AB(A)) = (1 − α),p(¬AB(A)) = α. Given this evidence,ε = 1/2 gives a pos-terior ofp(¬AB(A)) to be 2α

α+1 .

Definition 4 [Raiman et al., 1991] A component behavesnon-intermittently if its outputs are a function of its inputs.

Figure 4: Average cost vs.ε for 1 for circuit c880. c880has 384 gates and is an 8-bit arithmetic logic unit from theISCAS-85 test suite.

Table 2: The4 possible binary functions of one input and oneoutput.i is the binary input, and each columnfi list the out-puts for the corresponding input.

i f0 f1 f2 f3

0 0 0 1 11 0 1 1 0

Table 2 describes all possible binary functions of one in-put and one output.f3 describes the correct behavior of aninverter,f0 is the fault “stuck-at-0,”f2 is the fault “stuck-at-1,” andf1 an unexpected short of input to output. Given thateach fault modefi 6=3 has equal probability (α3 ), the correctposterior should be 3α

2α+1 . This corresponds toε = 13 . The

difference is:α(α− 1)

(2α + 1)(α + 1).

Suppose that we had measuredb = 0 instead. Table 2shows that only one of the three possible faulty behaviors areeliminated:f2 is eliminated, butf0 andf1 remain. Therefore,p(x = v|D) = 2

3 .For simplicity consider only the first three inverters and

the double fault diagnosis ofD(B, C, A). The prior is(1 − α)α2. After measuringd = 0, GDE reduces its proba-bility by 1

2 . Given Table 2 we can compute the posterior prob-ability exactly as follows. InverterC is faulted with output0and thus it can only be behaving according to functionsf0

or f1. The input to the faulty inverterB is 1, but that alonedoes not provide any evidence for changing its probability.For thef1 mode of inverterC to produced = 0, c must be0. This is inconsistent with modesf1 or f2 of B. Therefore,there are only4 consistent combinations of modes forB andC: 〈f0, f0〉, 〈f0, f1〉, 〈f1, f0〉, 〈f2, f0〉. As only4 out of the9 combinations survive, the posterior probability is reducedby 4

9 . Measuringc = 0 eliminates only〈f2, f0〉 to a finalreduction of39 . Tables 3 and 4 summarize these calculations.

Consider a simple 2-inputandgate. Table 5 lists all possi-ble behaviors for a 2-input/1-output gate. The correct behav-ior for the and gate is given byf2. All remaining behaviorscorrespond to fault modes.


Table 3: GDE vs. correct probability changes forD(B, C, A)

OBS ε = 1/2 correct

d = 0 12

49

c = 0 14

39

Table 4: GDE vs. correct probability changes forD(A,B, C, )

OBS ε = 1/2 correct

d = 0 12

23

c = 0 14

49

b = 0 18

827

Table 5: Possible functions of two inputs and one output.

i0 i1 f0 f1 f2 f3 f4 f5 f6 f7

0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 1 11 1 0 0 1 1 0 0 1 11 0 0 1 0 1 0 1 0 1i0 i1 f8 f9 f10 f11 f12 f13 f14 f15

0 0 1 1 1 1 1 1 1 10 1 0 0 0 0 1 1 1 11 1 0 0 1 1 0 0 1 11 0 0 1 0 1 0 1 0 1

This analysis and the results of Figures 2 and 3 suggest thatan adaptiveε-policy might be exploited to further improve thediagnostic cost of isolating faulty components.

7 Extension to Fault ModesThe probabilistic framework outlined in Sections 2 and 3 canbe directly expanded to include component fault modes[deKleer and Williams, 1989]. The AB/¬AB framework canviewed as assigning each component a “G”, good, mode oran “U”, unknown, faulty mode. The good model for anandgate is given by columnf2 of Table 5, and faulty behaviorscorrespond to all the remaining columns. Two common faultmodels for aandgate “SA1” (output stuck at1) and “SA0”(output stuck at0). This model for the gate has4 modes: “G”(f2) “SA1”(f15), “SA0” (f0), and “U” (which corresponds tothe remaining13 columns). All the analyses of this paper di-rectly extend to multiple fault modes, butp(U) will always besmaller because some of the fault modes are explicitly mod-eled with their own probabilities. As the extension to faultmodes is direct, we do not formalize them in this paper.

8 Dynamic Epsilon PoliciesOne possible dynamic policy that has been suggested is amax-entropy policy where eachεik is chosen to maximizethe entropy of the value distribution of thexi. The maximumentropy distribution for a two valued quantity is trivially com-putable. The motivation for this policy is that it injects leastamount of information into the measurement scoring. Thispolicy does not yield significant improvement in overall di-agnostic cost.

Using the framework of the previous sections and two addi-tional assumptions, we now define anε-policy which is com-putesε exactly for each individual variable and candidate di-agnosis. Intuitively, we apply the diagnostic framework laidout in Section 2 recursively for each possible candidate. Weassume that each component model specifies an output valuewhen all its inputs are known. This assumption holds for dig-ital circuits, but may not apply for some qualitative model-ing paradigms where adding a qualitative “+” and “−” yieldsno result. In addition, we assume that we are provided theprior probabilities for each possible function of a component.For example, in the case of anand gate we are given theprior probability of each possible functionfi of Table 5. Letp(fi(c)) be the prior probability that componentc behavesaccording to functionfi. We know that:

∑

fi

p(fi(c)) = 1,

and, ∑

fi∈F (c)

p(fi(c)) = p(c),

whereF (c) is the set of all faulty functionsfi andp(c) isthe prior probability that componentc is faulted as defined inEquation 1. Consider a diagnosisD = D(B, G) which failsto predict somex = v. Restating Equation 3:

p(x = v|D(B,G)) = εikp(D(B, G)).


Definition 5 A micro candidate diagnosisM for candidatediagnosisD = D(B, G) is a conjunction:

∧

c∈B

f ′i(c),

wheref ′i is formula describing the behavior offi as a propo-sitional formula, andM is consistent withD ∪ SD ∪OBS.

p(M) follows straightforwardly from Bayes Rule:

p(M) =∏

c∈B p(fi(c))∑p(M)

.

Thus,

p(x = v|D) =∑

Ms.t.M∪D∪SD∪OBS`x=v

p(M)p(D),

or,εik =

∑

Ms.t.M∪D∪SD∪OBS`x=v

p(M),

where theM are the micro candidate diagnoses forD.If all the faultedp(fi(c)) are equal for each componentc,

this new framework reduces to that of the previous section.This approach is most powerful when the individualp(fi(c))vary significantly. In these cases, the resulting diagnostic ef-ficiency improvement can be significant.

Using GDE with fault modes, identical results would beobtained if explicit fault modes were introduced for each pos-sible faulty function of each component. Unfortunately, thisapproach is computationally intractable. For a digital circuitwhere the sum of the number of input terminals wasl, thecomplexity would be22l

.

9 ConclusionsThis paper presents two advances. First, it presents a gener-alization of the information gain equation used in evaluatingpossible measurements. Second, this paper presents an algo-rithm which improves the average expected costs of diagnosisby exploiting more precise estimates ofε’s. Measures of im-proved diagnostic costs are found in the longer version of thispaper.

10 AcknowledgmentsConversations with Olivier Raiman and Brian Williamshelped clarify many of these concepts.

References[Brglez and Fujiwara, 1985] F. Brglez and H. Fujiwara. A neutral

netlist of 10 combinational benchmark circuits and a target trans-lator in fortran. InProc. IEEE Int. Symposium on Circuits andSystems, pages 695–698, June 1985.

[Brusoniet al., 1998] Vittorio Brusoni, Luca Console, Paolo Teren-ziani, and Daniele Theseider Dupre. A spectrum of defini-tions for temporal model-based diagnosis.Artificial Intelligence,102(1):39–79, 1998.

[de Kleer and Williams, 1987] J. de Kleer and B. C. Williams. Di-agnosing multiple faults.Artificial Intelligence, 32(1):97–130,April 1987. Also in: Readings in NonMonotonic Reasoning,edited by Matthew L. Ginsberg, (Morgan Kaufmann, 1987), 280–297.

[de Kleer and Williams, 1989] J. de Kleer and B.C. Williams. Di-agnosis with behavioral modes. InProc. 11th IJCAI, pages 1324–1330, Detroit, 1989.

[de Kleeret al., 1992] J. de Kleer, A. Mackworth, and R. Reiter.Characterizing diagnoses and systems.Artificial Intelligence,56(2-3):197–222, 1992.

[Kob and Wotawa, 2004] Daniel Kob and Franz Wotawa. Introduc-ing alias information into model-based debugging. In16th Eu-ropean Conference on Artificial Intelligence (ECAI), Valencia,Spain, August 2004.

[Raimanet al., 1991] O. Raiman, J. de Kleer, V. Saraswat, andM. H. Shirley. Characterizing non-intermittent faults. InProc. 9thNational Conf. on Artificial Intelligence, pages 849–854, Ana-heim, CA, July 1991.

[Steinbauer and Wotawa, 2005] Gerald Steinbauer and FranzWotawa. Detecting and locating faults in the control softwareof autonomous mobile robots. InProceedings of the19th Inter-national Joint Conference on AI (IJCAI-05), pages 1742–1743,Edinburgh, UK, 2005.

[Struss and Price, 2004] Peter Struss and Chris Price. Model-basedsystems in the automotive industry.AI Magazine, 24(4):17–34,2004.

[Williams and Nayak, 1996] B. C. Williams and P. P. Nayak. Amodel-based approach to reactive self-configuring systems. InProc. 14th National Conf. on Artificial Intelligence, pages 971–978, 1996.


Incremental Indexing of Temporal Observations in Diagnosis of Active SystemsGianfranco Lamperti Marina Zanella

Dipartimento di Elettronica per l’Automazione

Via Branze 38, 25123 Brescia, Italy

Tel: +390303715596 Fax: +39030380014

[email protected] [email protected]

Abstract

Observations play a major role in diagnosis ofdiscrete-event systems (DESs). At a high level ofabstraction, as in the active system approach, thistask takes as input the observable events gener-ated by a DES and their emission order. However,uncertainty conditions, affecting the transmissionfrom the DES to the observer and/or the capabil-ities of the observer itself, may obscure the (dis-crete) values of these events and/or their reciprocalorder. Thus, in the general case, an uncertain ob-servation can be represented as a directed acyclicgraph. Out of efficiency, diagnostic processing re-quires generating a surrogate of such a graph, theindex space. The scenario becomes more compli-cated when the observation is perceived as a list offragments rather than in one shot, because the setof candidate diagnoses is supposed to be generatedat the reception of each fragment. This translatesto the need for computing a new index space everytime. Since the computation from scratch is expen-sive, an incremental technique is proposed, that iscapable of extending the previous index space forproducing the new one at the occurrence of eachobservation fragment.

1 IntroductionObservations are the inputs to several tasks that can be carriedout by exploiting model-based reasoning techniques[Brusoniet al., 1998; Baroniet al., 1999; Roze and Cordier, 2002;Wotawa, 2002; Cordier and Pencol´e, 2005; Lamperti andZanella, 2006]. Temporal observations, being inherent to dy-namical systems and processes, are endowed not only witha logical content, describingwhat has been emitted by thesystem, but also with a temporal content, describingwhenit has been emitted. Both (independent) aspects can be mod-eled either quantitatively or qualitatively. A general model for(qualitative uncertain) temporal observations was proposed in[Lamperti and Zanella, 2002], and exploited for describingthe input of an a posteriori diagnosis task. Such a model con-sists of a directed acyclic graph where each node containsan uncertain logical content, ranging over a set of qualita-tive values (labels), and each edge is a temporal precedence

relationship, entailing a partial emission order. The obser-vation graph implicitly represents all the possible sequencesof labels consistent with the temporal observation receivedover a time interval, where each sequence is a sentence of alanguage. In the same contribution it is remarked that, al-though the observation graph is intuitive and easy to buildfrom the point of view of the observer, for the sake of effi-ciency of any further processing it is better to represent a lan-guage in the standard way regular languages are represented[Hopcroft and Ullman, 1979], that is, by means of a determin-istic automaton. This automaton is calledindex spaceand itis built as the transformation of a nondeterministic automa-ton drawn from the observation graph. The problem withthis construction method arises when the nodes of the obser-vation graph are received and processed one at a time, typ-ically in monitoring-based diagnosis of dynamical systems.The need for producing appropriate diagnostic information ateach occurring piece of observation[Lamperti and Zanella,2004] translates to the need for generating a new index spaceat each new reception. However, a naive approach, that eachtime makes up the new index space from scratch, would becomputationally inadequate. Therefore, this paper proposesa method for the incremental generation of the index space.The new algorithm is expected to benefit not only theactivesystemsapproach[Lamperti and Zanella, 2003], within whichthe notion of an index space was first proposed, but also anyother approach dealing with discrete uncertain observationswhose observable fragments are received and processed oneat a time.

2 Temporal ObservationsA history h of a system is a sequence of state transi-tions,h = 〈T1, . . . , Tn〉, that produces atemporal sequence〈`1, . . . , `k〉, where eachi, i ∈ [1 ..k], k ≤ n, is the observ-able label generated by a visible transition inh.

The DES evolution described byh is perceived outside thesystem as a temporal observation

O = 〈ϕ1, . . . , ϕn〉which is a sequence of temporal fragments, totally orderedaccording to the order in which such fragments were receivedby the observer.O brings some information about the tem-poral sequence generated byh. However, such informationis uncertain due to synchronization errors affecting the multi-plicity of communication channels between the (possibly dis-


tributed) system and the observer, and to noise on such chan-nels.

Formally, letΛ be a domain of observable labels, includingthenull label ε (invisible to the observer). Atemporal frag-mentϕ is a pair(λ, τ ), λ being thelogical content, λ ⊆ Λ,λ 6= ∅, λ 6= ε, andτ the temporal content. The logicalcontent represents what has been observed while the tempo-ral content identifies the set of fragments preceding the cur-rent one in the emission order. We assume that the currentfragment can be preceded in the emission order only by frag-ments that have already been received. Therefore, the tempo-ral content of a fragmentϕi is a (possibly empty) subset ofthe fragments precedingϕi in O (i.e. the fragments that werereceived beforeϕi), that is

∀i ∈ [1 ..n], ϕi = (λi, τi) (τi ⊆ ϕ1, . . . , ϕi−1) .

A fragment is uncertain in nature, both logically and tem-porally. Logical uncertainty means thatλ includes the ac-tual (possibly null) label generated by a system transition, butfurther spurious labels may be involved too. Temporal un-certainty means that only partial emission ordering is knownamong fragments.

A sub-observationO[i] of O, i ∈ [0 ..n], is the (pos-sibly empty) prefix ofO up to the i-th fragment,O[i] =〈ϕ1, . . . , ϕi〉.Example 1. Let Λ = short, open , ε, O = 〈ϕ1, ϕ2, ϕ3,ϕ4〉, whereϕ1 = (short , ε, ∅), ϕ2 = (open , ε, ϕ1),ϕ3 = (short , open, ϕ2), ϕ4 = (open, ϕ1). ϕ1 islogically uncertain (eithershort or nothing has been emittedby the system).ϕ2 follows ϕ1 in the emission order and islogically uncertain (open vs. nothing). ϕ3 follows ϕ2 andis logically uncertain (shortvs. open). ϕ4 follows ϕ1 and islogically certain (open). No temporal relationship is definedbetweenϕ4 andϕ2 or ϕ3.Based onΛ, a temporal observationO = 〈ϕ1, . . . , ϕn〉can be represented by a DAG, called anobservation graph,γ(O) = (Λ, Ω, E), whereΩ = ω1, . . . , ωn is the set ofnodes isomorphic to the fragments inO, each node beingmarked by a nonempty subset ofΛ, andE is the set of edgesisomorphic to the temporal content of fragments inO. Anemission-orderprecedence relationshipis defined betweennodes of the graph, specifically,ω ≺ ω′ means thatγ(O)includes a path fromω to ω′, while ω ω′ means eitherω ≺ ω′ or ω = ω′. We assume that the temporal content ofeach fragment is minimal, which translates to thecanonicityof the observation graph. Specifically, the following condi-tion holds:

∀(ωj 7→ ωi ∈ E) (@(ωk 7→ ωi ∈ E), ωk ≺ ωj).

Example 2. The observation graphγ(O), relevant to the ob-servation defined in Example 1, is shown in Fig. 1. Notehow γ(O) implicitly contains several candidate temporal se-quences, each generated by picking up a label from each nodeof the graph without violating the partial emission-order re-lationships among nodes. Possible candidates are, amongothers,〈short, open , short, open〉, 〈short, open , open〉, and〈short , open〉.1 However, we do not know which of the can-didates is the actual temporal sequence generated by the sys-tem, the other ones being thespuriouscandidate sequences.

1The length of a candidate temporal sequence may be shorter

Figure 1: Observation graphγ(O).

Consequently, from the observer (and, therefore, from the di-agnosis) viewpoint, all candidate sequences share the sameontological status.

3 Indexing Temporal ObservationsBoth for computational and space reasons,the observationgraph is inconvenient for carrying out a task that takes asinput a temporal observation. This claim applies tolinearobservationsas well, each of which is merely a sequenceOof observable labels. In this case, it is more appropriate torepresent each sub-observationO′ of O as an integer indexi corresponding to the length ofO′. As such,i is a surro-gate ofO′. An analogous approach was proposed for graph-based temporal observations in[Lamperti and Zanella, 2000],where the notion of an index was extended so as to performmodel-based reasoning on a surrogate of the temporal obser-vation, called an index space.

Let γ(O) = (Λ, Ω, E). A prefix P of O is a (possiblyempty) subset ofΩ where

∀ω ∈ P (@ ω′ ∈ P (ω′ ≺ ω)).

The formal definition of an index space is supported by twofunctions onP. The set ofconsumed nodesup toP is

Cons(P) = ω | ω ∈ Ω, ω′ ∈ P, ω ω′.

Thefrontier of P is

Front(P) = ω | ω ∈ (Ω − Cons(P))

where∀(ω′ 7→ ω) ∈ E (ω′ ∈ Cons(P)).

Example 3. Consideringγ(O) in Fig. 1, withP = ω2, ω4,we haveCons(P) = ω1, ω2, ω4 andFront(P) = ω3.

Theprefix spaceof a temporal observationO is the nonde-terministic automaton

Psp(O) = (Sn, Ln, Tn, Sn0 , Sn

f )

whereSn = P | P is a prefix ofO

is the set of states,

Ln = ` | ` ∈ λ, (λ, τ ) ∈ Ω

is the set of labels,Sn

0 = ∅

than the number of nodes in the observation graph owing to the im-materiality of the null labelε, which is ‘transparent’. For instance,candidate〈ε, ε, short, open〉 is in fact〈short, open〉.


Figure 2: Prefix spacePsp(O) and index spaceIsp(O).

is the initial state,

Snf = P | P ∈ Sn,Cons(P) = Ω

is the set of final states, andTn : Sn × Ln 7→ 2Snis the

transition function such that

P `−→ P′ ∈ Tn

iff, defining the ‘⊕’ operation as

P ⊕ ω = (P ∪ ω) − ω′ | ω′ ∈ P, ω′ ≺ ω, (1)

we haveω ∈ Front(P), ω = (λ, τ ), ` ∈ λ,P ′ = P ⊕ ω.The index spaceof O is the deterministic automaton

Isp(O) equivalent toPsp(O). Each state inIsp(O) is an in-dex ofO. Each path from the initial state ofIsp(O) to a finalstate is a mode in which we may choose a label in each nodeof the observation graphγ(O) based on the partial orderingimposed byγ(O) [Lamperti and Zanella, 2002], that is, eachpath in the index space is a candidate temporal sequence and,being Isp(O) deterministic, there is only one path for eachcandidate sequence.Example 4. Considerγ(O) in Fig. 1. Shown in Fig. 2 arethe prefix spacePsp(O) (left) and the index spaceIsp(O)(shaded). Each prefix is written as a string of digits, e.g. 24stands forP = ω2, ω4. Final states are double circled.According to the standard algorithm that transforms a non-deterministic automaton to a deterministic one[Hopcroft andUllman, 1979], each node ofIsp(O) is identified by a subsetof the nodes ofPsp(O). Nodes inIsp(O) have been named=0 · · ·=7. These are the indexes ofO.

As for observations, we may define a restriction of the in-dex space up to thei-th fragment as follows. LetIsp(O) =(S, L, T, S0, Sf) be an index space, whereγ(O) = (Λ, Ω, E),Ω = ω1, . . . , ωn. Let S be a node inS. Thesub-nodeS[i]

of S, i ∈ [0 ..n], is

S[i] =

∅ if i = 0= | = ∈ S, ∀ωj ∈ = (j ≤ i) otherwise. (2)

Thesub-index spaceIsp [i] of O, i ∈ [0 ..n], is an automaton

Isp [i](O) = (S′, L′, T′, S0, S′f)

where

S′ = S′ | S ∈ S, S′ = S[i], S′ 6= ∅

T′ = T ′ | T ∈ T, T = S1`−→ S2

T ′ = S′1

`−→ S′2, S

′1 = S1[i]

S′1 6= ∅, S′

2 = S2[i], S′2 6= ∅

L′ = ` | S′1

`−→ S′2 ∈ T′

S′f = S′ | S′ ∈ S′,= ∈ S′

Cons(=) = ω1, . . . , ωi

The formal relationship between sub-observations and sub-index spaces is stated by Theorem 1.

Theorem 1. The sub-index space of an observation equalsthe index space of the sub-observation,

Isp [i](O) = Isp(O[i]). (3)

Proof (sketch). The proof is supported by three lemmas.Lemma 1.1 is grounded on the definition ofPsp(O) and, par-ticularly, on Eq. (1). Lemma 1.2 derives from the definition ofsub-index space. Lemma 1.3 is based on the subset construc-tion algorithm[Aho et al., 1986], which transforms a nonde-terministic automatonAn into a deterministic one.Clos(Nn)denotes theε-closure of nodeNn in An. This is the set madeup by Nn and all the nodes that are reachable fromNn viaε-transitions inAn.

Figure 3: Genesis ofIsp [i](O) andIsp(O[i]).


Figure 4:Psp(O[3]), Isp(O[3]), andIsp [3](O).

Lemma 1.1. Let P `−→ P′ be a transition inPsp(O). LetMax (=) denote the most recent fragment ofP in O, namelyMax (P) = i | ωi ∈ P, ∀ωj ∈ = (j ≤ i). Then,Max (=′) ≥Max (=).

Lemma 1.2. Let=i`−→ =′

i be a transition inIsp [i](O). Then,

= `−→ =′ is a transition inIsp(O), where=i = =[i] and=′i =

=′[i].

Lemma 1.3. Let= `−→ =′ be a transition inIsp(O). Then,

∀P ′ ∈ =′ (P ′ ∈ Clos(P′′),P′′ ∈ =′, (4)

P `−→ P′′ ∈ Psp(O),P ∈ =).

Theorem 1 can be proven by induction on the nodes of thetwo automata in Eq. (3). The basis states the equality of theinitial states. LetP0, P ′

0, =0, and=′0 be the initial states

of Psp(O), Psp(O[i]), Isp [i](O), andIsp(O[i]), respectively.We have:

=0 = P | P ∈ Clos(P0), ∀ωj ∈ P (j ≤ i) (5)

=′0 = P | P ∈ Clos(P ′

0). (6)

SinceP0 = P′0 = ∅, based on Lemma 1.1, the subset of the

nodes withinClos(P0) in Eq. (5) is in factClos(P ′0), thereby

making=0 = =′0.

The induction step is guided by Fig. 3, that shows howIsp [i](O) andIsp(O[i]) are generated starting fromO.

Assume a transition=i`−→ =′

i ∈ Isp [i](O), where=i isalso a node inIsp(O[i]). We have to show that the sametransition is in Isp(O[i]) too. Consider aP′ ∈ =′

i. Ac-

cording to Lemma 1.2,= `−→ =′ ∈ Isp(O), where=i = =[i]

and=′i = =′

[i]. Based on Eq. (2),P′ ∈ =′. According toLemma 1.3,P′ is reachable inPsp(O) from a prefixP ∈ =via a path, whose first transition is marked by`, all the subse-quent transitions being marked byε. Lemma 1.1 assures thatall the prefixes involved in such a path (the root included) arecomposed ofωj such thatj ≤ i. This means that the same thesame path is also inPsp(O[i]), as it corresponds to choosing

labels from nodes inO that are also inO[i]. Since, by as-sumption,=i is also a node inIsp(O[i]), this means that the

latter will include a transition=i`−→ =′′

i , whereP′ ∈ =′′i . We

have to show that=′′i = =′

i. To this end, assume aP′ ∈ =′′i .

Owing to Lemma 1.3,P′ is reachable inPsp(O[i]) from aprefixP ∈ =i via a path, whose first transition is marked by`, all the subsequent transitions being marked byε. BeingOa monotonic extension ofO[i], the same path will be also inPsp(O). Since,=i is also a node inIsp [i](O) and being the

latter deterministic, the target node=′i in transition=i

`−→ =′i

will include P′. Thus,=′′i = =′

i, that is,=i`−→ =′

i is also inIsp(O[i]). In a similar way it is possible to prove that, assum-

ing a transition=i`−→ =′

i ∈ Isp(O[i]), where=i is also a nodein Isp[i](O), the same transition is inIsp [i](O) too (the proofis left to the reader). This complete the induction step, whichindicates the equality of the transition functions. To completethe proof of the theorem, we need showing the equality of theset of final states. LetS andS′ be the set of states ofIsp [i](O)andIsp(O[i]), respectively, andSf andS′

f the correspondingset of final states. According to the definition of sub-indexspace, we have

Sf = = | = ∈ S,P ∈ =,Cons(P) = ω1, . . . , ωi.Based on the subset construction algorithm,Sf is in fact theset of final states ofIsp(O[i]) too, that is,Sf = S′

f . Thisconcludes the proof of Theorem 1.

Example 5. Consider the observationO displayed in Fig. 1and relevant index space in Fig. 2. We show that, in com-pliance with Theorem 1,Isp[3](O) = Isp(O[3]). To thisend, shown on the left-hand side of Fig. 4 is the prefix spacePsp(O[3]), while the relevant index spaceIsp(O[3]) is de-picted on the center. On the right-hand side of the figure is atransformation of the index spaceIsp(O) outlined in Fig. 2.Specifically, each nodeS in Isp(O) has been transformedinto the subnodeS[3] by removing some (possibly all) of theindexes, as established by Eq. (2). For instance, in node=5,three, out of five indexes, have been dropped, namely 34, 24,and 4 (which stand forω3, ω4, ω2, ω4, andω4, respec-


tively), thereby producing the sub-node marked by 2 and 3.Note how the sub-node of=6 becomes empty after the re-moval of (the only) index 34. Based on the definition of sub-index space, empty nodes are not part of the result. This iswhy =6 and all entering edges are in dotted lines. A fur-ther peculiarity is the occurrences of duplicated sub-nodes, asfor example=3,=4,=7 and=1,=5. Each set of repli-cated nodes forms an equivalence class of sub-nodes whichresults in fact in a single node in the sub-index space. Thus,=3,=4,=7 and=1,=5 are collapsed into nodes 3 and 2,3, respectively. This aggregation causes edges entering and/orexiting nodes in each equivalence class to be redirected to thecorresponding sub-node in the result. Performing such ar-rangements on the graph and removing the dotted part, weobtain in fact the same graph depicted on the center of Fig. 4,namelyIsp(O[3]).

Corollary 1.1. Let O = 〈ϕ1, . . . , ϕn〉 be a temporal obser-vation. Then,∀i ∈ [0 ..n], ∀k ∈ [0 .. i],

Isp [i−k](O[i]) = Isp(O[i−k]).

4 Incremental IndexingIn case we need to compute the index space of each sub-observation ofO = 〈ϕ1, . . . , ϕn〉, namelyIsp(O[i]), i ∈[1 ..n], the point is,it is prohibitive to calculate each new in-dex space from scratch at the occurrence of each fragmentϕi, as this implies the construction of the nondeterminis-tic Psp(O[i]) and its transformation into the deterministicIsp(O[i]). A better approach is generating the new indexspace incrementally, based on the previous index space andthe new observation fragment, avoiding the generation andtransformation of the nondeterministic automaton.

This is performed by algorithmIncrement, generating thenew observation graphγ(O[i]) and relevant index spaceIsp(O[i]), based on the previousγ(O[i−1]) andIsp(O[i−1]),and the new fragmentϕi, as specified in Fig. 6.

Corollary 1.1 provides the formal basis for stating thatIsp(O[i−1]) is a good starting point for buildingIsp(O[i]).In fact, for k = 1, the corollary becomesIsp [i−1](O[i]) =Isp(O[i−1]), which means that there exists an operation(the sub-indexing) for obtainingIsp(O[i−1] given Isp(O[i]).What we are looking for is the inverse operation. Our claim(which is not formally proven in the present paper) is thatthe inverse operation exists and that theIncrementalgorithmperforms it.

In so doing,Incrementis supported by a data structure, thebud set, and a piece of knowledge, therule set, denotedB andR, respectively.

Each bud inB is a triple(N,P, ω), whereN is a node ofthe index space,P a prefix inN , andω a node of the obser-vation graph belonging to the frontier ofP. A bud indicates

Figure 5: Effect of mergingN andN ′.

1. Increment(γ(O[i−1]), Isp(O[i−1]), ϕi) ⇒ (γ(O[i]), Isp(O[i]))2. begin3. Generateγ(O[i]) by means of the new fragmentϕi = (λi, τi);4. InitializeIsp(O[i]) as a copy ofIsp(O[i−1]);5. B := (N,P, ωi) | N ∈ Isp(O[i]),P ∈ N,ωi ∈ Front(P);6. loop7. Pick up a budB = (N,P, ω) , ω = (λ, τ), from the bud setB;8. P ′ := P ⊕ ω;9. for each ` ∈ λ do10. ExtendIsp(O[i]) based on the rule setR defined in Table 111. end for;12. Remove budB fromB13. while B 6= ∅;14. Yield the final states ofIsp(O[i])15. end.

Figure 6: Increment algorithm.

that, owing to the new observation fragment,N needs furtherprocessing. This means, for instance, that all the candidatesequences of labels up toN are followed by a label belong-ing to the logical content of the new fragment. Therefore,Nhas to be extended, possibly either by new edges, leading toold nodes, and/or by new edges leading to new nodes. Onceprocessed, the bud is removed fromB. However, processinga bud possibly causes the generation of new buds since, forinstance, the candidate sequences of labels up to a newly cre-ated nodeN ′, ending with a label of the new fragment, canbe followed by labels inherent to fragments received beforeit. Therefore, alsoN ′ has to be extended.

Each ruleRi in R, i ∈ [1 ..8], is an associationcondition-action (Table 1). The conditions are mutually exclusive.They involve the current topology of the index space, the budB = (N,P, ω) picked up at the beginning of the body of theloop (line 7), the new prefixP′ (computed at line 8), and la-bel ` ∈ λ (line 9), beingω = (λ, τ ). If no condition holds,then no operation is performed. For instance, the action ofR1 merges nodesN andN ′, as shown in Fig. 5. To do so,all edges entering/leavingN are redirected to/fromN ′, whileN is removed. After the merging, the bud set must be up-

dated. The action ofR8, instead, redirects the edgeN`−→ N ′

towards the new nodeN ′ ∪ P′, as shown in Fig. 7, andduplicates the edges leavingN ′. This operation too requiresupdating the bud set.

When the loop terminates, the new index spaceIsp(O[i]) istopologically complete. Only the final states must be yielded(line 14): these are the nodes that contain a prefixP such thatFront(P) = ∅.

Example 6. Suppose that the sub-observationO[3] of obser-vationO of Example 1 has already been received by the ob-

Figure 7: Effect of redirectingN`−→ N ′.


Figure 8: Tracing of the incremental computation ofIsp(O[4]).


Table 1: Rule setR: each ruleRi, i ∈ [1 ..8], is an associationcondition-actionguiding the execution ofIncrementalgorithm.Rule Condition ActionR1 ` = ε,N ′ = N ∪ P ′ exists already,N ′ 6= N . N andN ′ aremerged;B is updated.R2 ` = ε,N ′ = N ∪ P ′ does not exists. N is extended withP ′; B is updated.

R3 ` 6= ε, no edge leavingN marked by , N ′ = P ′ already exists. A new edgeN`−→ P ′ is created.

R4 ` 6= ε, no edge leavingN marked by , N ′ = P ′ does not exist. N ′ = P ′ andN`−→ N ′ are created;B is updated.

R5 ` 6= ε, there existsN`−→ N ′, no other edge enteringN ′,

N = N ′ ∪ P ′ already exists,N 6= N ′. N ′ andN are merged;B is updated.

R6 ` 6= ε, there existsN`−→ N ′, no other edge enteringN ′,

N = N ′ ∪ P ′ does not exist. P ′ is inserted intoN ′; B is updated.

R7 ` 6= ε, there existsN`−→ N ′, there exists another edge enteringN ′,

N = N ′ ∪ P ′ already exists,N 6= N ′. N`−→ N ′ is substituted byN

`−→ N .

R8 ` 6= ε, there existsN`−→ N ′, there exists another edge enteringN ′,

N ′′ = N ′ ∪ P ′ does not exist. N`−→ N ′ is redirected towardsN ′′; B is updated.

server, one fragment at a time, and thatIncrementhas cor-rectly generatedγ(O[3]) andIsp(O[3]). Now the fourth andlast fragment ofO, ϕ4, is received andIncrementhas to gen-erateIsp(O[4]).

Shaded on the top-left of Fig. 8 isIsp(O[4]) at the begin-ning of the loop (line 6), which equalsIsp(O[3]), depicted onthe center of Fig. 4, with some extra information inherent toB drawn by processingϕ4.

Specifically, each bud(N,P, ωi) ∈ B is represented byPi

in nodeN . For example, bud(N2, ω3, ω4) is written inN2 as34. The subsequent graphs in Fig. 8 depict the compu-tational state ofIsp(O[4]) at each new iteration of the loop.According to the initial (shaded) graph, at first,B includeseight buds.

The budB chosen at each iteration (line 7) is shaded in thecorresponding pictorial representation. The loop is iteratedfourteen times:

(1) The bud picked up at the first iteration is(N3, ω3, ω4).At line 8, P ′ = ω3 ⊕ ω4 = ω3, ω4. Sinceλ(ω4) =open, the inner loop at line 9 is iterated only once, for` = open . This corresponds to ruleR4 in Table 1: thenew nodeN4 is created and linked fromN3 by an edgemarked byopen , as shown in graphStep1 (no new budis created).

(2) B = (N2, ω3, ω4), λ = open, andP ′ = ω3, ω4.This corresponds to ruleR8: nodeN5 is generated (nonew bud is created).

(3) B = (N1, ω3, ω4), λ = open, P′ = ω3, ω4,rule R8: nodeN6 is generated; moreover, a new bud(N6, ω2, ω4) is inserted intoB.

(4) B = (N2, ω2, ω4), λ = open, P′ = ω2, ω4,rule R8: nodeN7 is generated; moreover, a new bud(N7, ω2, ω4, ω3) is created.

(5) B = (N7, ω2, ω4, ω3), λ = short, open, andP ′ =ω3, ω4. For ` = short, this corresponds to ruleR3:edgeN7

open−−−→ N4 is created. For = open , no opera-tion (no condition is met).

(6) B = (N6, ω2, ω4), λ = open, P′ = ω2, ω4, ruleR5: nodesN5 andN7 are merged.

(7) B = (N1, ω2, ω4), λ = open, P′ = ω2, ω4, ruleR6: nodeN6 is extended with indexP′, and a new bud(N6, ω2, ω4, ω3) is created.

(8) B = (N6, ω2, ω4, ω3), λ = short, open, P′ =ω3, ω4. For ` = short, rule R8: nodeN8 is gener-ated. For = open , no operation.

(9) B = (N0, ω2, ω4), λ = open, P′ = ω2, ω4, andruleR6: nodeN2 is extended with indexP′, and a newbud(N2, ω2, ω4, ω3) is created.

(10) B = (N2, ω2, ω4, ω3), λ = short, open, andP ′ =ω3, ω4. For ` = short, ruleR7: edgeN2

short−−−→ N3

is redirected towardN8. For ` = open , no operation.

(11) B = (N1, ω1, ω4), λ = open,P′ = ω4, ruleR6:nodeN6 is extended withP′ and bud(N6, ω4, ω2) iscreated.

(12) B = (N6, ω4, ω2), λ = open , ε, P ′ = ω2, ω4.For ` = open , no operation. For = ε, no operation.

(13) B = (N0, ω1, ω4), λ = open, P ′ = ω4, ruleR6: nodeN2 is extended with indexP′, and a new bud(N2, ω4, ω2) is created.

(14) B = (N2, ω4, ω2), λ = open , ε, P ′ = ω2, ω4.For ` = open , no operation. For = ε, no operation.

SinceB is empty, the loop terminates. The final states ofIsp(O[4]) are N4, N6, N7, and N8. Note how the last(shaded) graph in Fig. 8 represents the same automatonIsp(O) in Fig. 2.

5 EXPERIMENTSThe Incrementalgorithm was first coded in Prolog and ex-periments based on this prototype were run so as to test thesoundness and completeness of the algorithm before formallyproving such properties. Further experiments on a successiveimplementation in C have shown that the algorithm achievesthe goal of efficiency too, which is the reason for it has beenproposed. The diagram in Fig. 9 represents the time (in sec-onds) to compute the index space of an uncertain temporalobservation composed of (up to) 600 fragments. The curveon the top is relevant to the computation of each index space


Figure 9: Experimental results: index-space computation-time (y-axis) vs. number of observation fragments (x-axis).

from scratch. The curve on the bottom corresponds to theincremental computation on the same platform.

6 CONCLUSION

Both the observation graph and the index space are model-ing primitives for temporal observations. Whereas the for-mer, which is a DAG, is the front-end representation, suitablefor modeling an observation while it is being received overa time interval, the latter, which is a deterministic automa-ton, is a back-end representation, suitable for model-basedproblem-solving. In fact, in case the notion of an index spacewere not adopted for problem-solving, it would be necessaryto compute all the sentences of the language defined by theobservation and then to perform model-based reasoning onall of them. Moreover, the notion of an index space bringsthe advantage of adopting for observations the same formal-ism traditionallyexploited for component models of DESs, bethey synchronous or asynchronous. Actually, in the literatureeach reasoning step performed on the behavior of DESs trans-lates to the composition of two or more automata. Now thatthe awareness has grown that DES observations can be repre-sented as automata themselves (see, for instance,[Grastienetal., 2005]), the (only) operation that is needed for carrying outseveral model-based tasks is the synchronization between au-tomata, where observations are handled exactly the same wayas the other models. Finally, the index space, since adheringto the standard formal representation of regular languages,could be adopted as an interchange format of uncertain ob-servations among distinct application contexts.

This paper has presented an algorithm for constructing theindex space incrementally, while receiving observation frag-ments one at a time. The tests performed so far have shownthat the proposed technique brings a significant reduction ofthe computation time whenever a (nonmonotonic) processingstep has to be performed after each observation fragment isreceived, as is when the tasks of supervision and dynamic di-agnosis (and state estimation, in general), are considered. Itis likely that other approaches to model-based reasoning onDESs can take advantage of this result since the algorithmproposed in this paper relies on the model of a DES obser-vation, which is, to a large extent, independent of the modeladopted for the DES itself. The research still needs to performcomputational analysis and to compare it with the experimen-tal results.

References[Aho et al., 1986] A. Aho, R. Sethi, and J.D. Ullman.Com-

pilers – Principles, Techniques, and Tools. Addison-Wesley, Reading, MA, 1986.

[Baroniet al., 1999] P. Baroni, G. Lamperti, P. Pogliano, andM. Zanella. Diagnosis of large active systems.ArtificialIntelligence, 110(1):135–183, 1999.

[Brusoniet al., 1998] V. Brusoni, L. Console, P. Terenziani,and D. Theseider Dupré. A spectrum of definitions fortemporal model-based diagnosis.Artificial Intelligence,102(1):39–80, 1998.

[Cordier and Pencolé, 2005] M.O. Cordier and Y. Pencolé.A formal framework for the decentralized diagnosis oflarge scale discrete event systems and its applicationto telecommunication networks.Artificial Intelligence,164:121–170, 2005.

[Grastienet al., 2005] A. Grastien, M.O. Cordier, andC. Largouet. Incremental diagnosis of dicrete-event sys-tems. InSixteenth International Workshop on Principles ofDiagnosis – DX’05, pages 119–124, Monterey, CA, 2005.

[Hopcroft and Ullman, 1979] J.E. Hopcroft and J.D. Ull-man. Introduction to Automata Theory. Addison-Wesley,Reading, MA, 1979.

[Lamperti and Zanella, 2000] G. Lamperti and M. Zanella.Uncertain temporal observations in diagnosis. InFour-teenth European Conference on Artificial Intelligence –ECAI’2000, pages 151–155, Berlin, D, 2000.

[Lamperti and Zanella, 2002] G. Lamperti and M. Zanella.Diagnosis of discrete-event systems from uncertain tempo-ral observations.Artificial Intelligence, 137(1–2):91–163,2002.

[Lamperti and Zanella, 2003] G. Lamperti and M. Zanella.Diagnosis of Active Systems – Principles and Techniques,volume 741 ofThe Kluwer International Series in Engi-neering and Computer Science. Kluwer Academic Pub-lisher, Dordrecht, NL, 2003.

[Lamperti and Zanella, 2004] G. Lamperti and M. Zanella.A bridged diagnostic method for the monitoring of poly-morphic discrete-event systems.IEEE Transactions onSystems, Man, and Cybernetics – Part B: Cybernetics,34(5):2222–2244, 2004.

[Lamperti and Zanella, 2006] G. Lamperti and M. Zanella.Flexible diagnosis of discrete-event systems by similarity-based reasoning techniques. Artificial Intelligence,170(3):232–297, 2006.

[Roze and Cordier, 2002] L. Roze and M.O. Cordier. Diag-nosing discrete-event systems: extending the ‘diagnoserapproach’ to deal with telecommunication networks.Jour-nal of Discrete Event Dynamic Systems: Theory and Ap-plication, 12:43–81, 2002.

[Wotawa, 2002] F. Wotawa. On the relationship betweenmodel-based debugging and program slicing.Artificial In-telligence, 135(1–2):125–143, 2002.


Introducing Data Reduction Techniques intoReason Maintenance

Rudiger LundeUniversity of Applied Sciences Ulm

Prittwitzstrasse 10, 89075 Ulm (Germany)email: [email protected]

Abstract

Every problem which can be solved with a reasonmaintenance system can – in theory at least – besolved as well without it, using a simple generateand test algorithm. Therefore, its main purpose isto increase the efficiency of a reasoning system. Inthis paper, we analyze the impact of problem char-acteristics on the performance. For problems whichinclude reasoning about physical models on quan-titative level, the complete dependency networkmaintained by reason maintenance systems is iden-tified as a major resource consumer. Based on thatanalysis a new component called ‘value manager’ ispresented, which applies data reduction techniquesto limit dependency management costs and is es-pecially designed to support iterative solvers. Astwo examples of practical applications, experimen-tal results from an automated FMEA generation runfor an automotive system and a diagnosis of a fly-by-wire system are discussed.

1 IntroductionA reason maintenance system (RMS) is a book-keeping toolsupporting a problem solver. It tracks dependencies betweengiven and derived data during the inference process and con-tributes to the search for solutions in two ways. Firstly, itanalyzes the causes of failures and thus helps to avoid uselesssearch in subspaces without solutions. Secondly, it caches in-ferences and prevents the problem solver from redrawing thesame inferences again and again.

An RMS views data as propositional symbols and relation-ships between the data as propositional clauses. This view isindependent of the actual meaning of the data for the problemsolver, which can be first or higher order, for example. Givensets of assumptions, data, and clauses, the RMS derives thebelief status of every datum based on the current belief statusof the assumptions. An RMS works incrementally. After thebelief in some of the assumptions changes or a new clause isadded, the belief status of all affected data is updated with-out starting from scratch. A contradiction is discovered ifthe belief status of a special datum denoting falsity changesfrom disbelieved to believed. Contradictions are immediately

signaled to the problem solver, and information about the re-sponsible assumptions is attached to the signal.

The use of reason maintenance systems has a long traditionin AI. Starting with the introduction of the non-monotonicJTMS [Doyle, 1979], a wide range of different reason main-tenance systems has been developed (e.g. monotonic JTMS,ATMS, LTMS). They differ in the accepted types of clauses,their support w.r.t. contradiction resolution, and the way thebelief status is represented in labels associated with data.The assumption-based truth maintenance system (ATMS) [deKleer, 1986a][de Kleer, 1986b][de Kleer, 1986c] as part ofthe general diagnostic engine (GDE) [de Kleer and Williams,1987] has attracted much attention in the field of model-basedreasoning. It accepts Horn clauses and uses a special indexingscheme to store the sets of assumptions a datum ultimatelydepends on. It supports parallel search in different contextsand is especially suited for problems with many solutions, ifall or several of them are of interest. This is usually the case inexplanatory problems, where we want to know all or at leastthe most plausible causes for observed or assumed effects.

It is well known that the worst-case complexity of anATMS is exponential in the number of assumptions. Reason-ing systems utilizing an ATMS spend a considerable amountof time on label updates. Several extensions have been de-veloped to improve the efficiency of the reasoning system asa whole. Forbus and de Kleer [Forbus and de Kleer, 1988]proposed a consumer control mechanism to focus inferenceselection on interesting contexts. Additionally, Dressler andFarquhar [Dressler and Farquhar, 1991] showed how the la-bel update can be restricted to interesting contexts. Althoughlabel completeness is lost, completeness with respect to thecurrent focus can still be guaranteed. Lazy label evaluation[Kelleher and van der Gaag, 1993] delays the label updateuntil the problem solver shows interest in it. By combiningfocusing and lazy label evaluation, some synergy effects canbe obtained [Tatar, 1994].

Reason maintenance systems have proven to be efficienttools to support reasoning about qualitative abstractions ofreal systems. However, when choosing lower levels of ab-straction, two negative impacts on the efficiency of the rea-soning system can be observed:

• The dependency management costs grow dramaticallydue to the increasing absolute number of assumptionsand inferences needed for behavior prediction.


• The benefit achieved by the services of a reason main-tenance system decreases because of the decreasing de-gree of similarity between different contexts and the gapbetween the true meaning of the derived data and thereason maintenance system’s propositional view of it.

Both observations give reason to look for more efficient alter-natives.

2 Reason Maintenance Systems seen from thePerspective of Machine Learning

From the point of view of machine learning, an RMS playsthe role of a learning module. It tries to find out as much aspossible about the currently solved problem by collecting andanalyzing the inferences drawn so far.

In all reason maintenance systems memorizing plays a ma-jor role. All inferences drawn so far are stored together withtheir results in a dependency network. No attempt is madeto generalize the given information or to remove unimpor-tant nodes. The main problem with this learning method isthat memory space grows monotonously and the managementoverhead with it. So there is a tradeoff between the gains re-sulting from learned information and the overhead to main-tain the knowledge base. In large search problems, the effi-ciency of the overall performance of an RMS based reasoningsystem typically increases first, then reaches a maximum, anddecreases steadily afterwards.

It is obvious that maintaining a cache for all inferences everperformed is inefficient for iterative solvers. To reduce mem-ory consumption for intermediate results, the learning mod-ule should abstract from dependencies on inference layer andfocus on the relationship between the assumptions definingthe interesting contexts and the final results obtained fromthem. The ATMS already incorporates an efficient methodto compile dependency information given on inference layerinto an explicit representation of the relationship between as-sumptions and derived data. For each datum, it maintains alabel which represents the set of all currently known consis-tent combinations of assumptions supporting the derivationof the datum. In contrast to the memorizing method usedfor low level dependency tracking, the learning method usedhere incorporates a generalization step. The combinations ofsupporting assumptions are not enumerated directly but char-acterized by a lower bound, which is updated after each infer-ence step for all affected data. The label representation canbe viewed as a conceptual description of all consistent con-texts to which the corresponding datum belongs, and the in-cremental label update as a special kind of inductive conceptlearning.

Since the label update algorithms do not necessarily re-quire the availability of a complete dependency network, anew class of dependency trackers can be defined which isbased only on the second learning method. In the following,we discuss one special instance of this class.

3 The Value ManagerThe value manager is a dependency tracking tool which is es-pecially designed to support efficient reasoning about systems

on quantitative physical level. It features fast inference selec-tion and label update during behavior prediction and negligi-ble management overhead for intermediate results. The valuemanager is based on Horn logic and also shares the label com-putation algorithms with an ATMS, but does not maintain adependency network. Instead, it incorporates some new func-tionality:

• It forgets. By applying data reduction methods, resourceconsumption with respect to space and computation timeis significantly reduced.

• It focuses on one context at a time.

• It distinguishes between short-term and long-term mem-ory management. Separate data buffers give fast accessto context specific data and use a common knowledgebase to exchange learned inference results.

The main functional difference w.r.t. a focused ATMS is thatthe value manager actively controls resource consumption byforgetting unimportant data. The task of selecting data forremoval from storage is seen here as a technical resource op-timization task like garbage collection, rather than a missioncritical strategic task. Therefore, the control about data reduc-tion is assigned to the value manager and not to the problemsolver. This design decision has significant consequences.

3.1 Consequences on the View of DataA propositional view of data, which is common to all reasonmaintenance systems, is not sufficient for the value manager.If the value manager shall select data for removal, it musthave the necessary knowledge to estimate the importance ofdata for further reasoning. This includes knowledge about thetrue meaning of data as well as methods to detect and removeredundancy. Since this knowledge cannot be formalized with-out assumptions about the inner structure of a datum, confine-ment to a certain kind of data is the price we have to pay forthe extended functionality.

In our application focus, state equations are the atomicpieces of knowledge the problem solver reasons about. Stateequations are pairs 〈v, d〉 containing a model variable and a(possibly infinite) set of values which is a subset of the corre-sponding variable domain. They represent propositions aboutthe possible values for a quantity of the physical system un-der analysis in a certain context. Every state equation with anempty value set denotes falsity. For effective data reduction,the value manager needs at least to be able to compare valuesets for the same variable with respect to set inclusion. To re-duce the communication overhead, the value manager shouldalso be able to intersect those value sets.

While the rigidity of information hiding between problemsolver and value manager is not as strict as in the classical ap-proach, there is still some abstraction in the value manager’sview of the inference process. For instance, the true meaningof assumptions, which are used to define the contexts of in-terest for the problem solver, is completely hidden from thevalue manager. Within a diagnostic problem, different con-texts may represent system states of different candidates aswell as state snapshots for dynamic systems at different pointsin time, or a combination of both.


3.2 Consequences on Label CompletenessWhenever a dependency tracking unit decides to remove adatum, it is possible that later on another justification may befound by the problem solver for the same datum. Togetherwith the removed datum, the knowledge about its successorsin the dependency network is lost. Consequently, when a pre-viously removed datum gets a new justification, label com-pleteness of its successors is lost, at least as long as the con-sequences have not been redrawn. Different strategies can beimagined to handle gaps in a dependency network. Their effi-ciency will strongly depend on the average size and absolutenumber of gaps. Our data reduction strategies aim at keepingmemory consumption constant during iterative pruning loopsand will therefore in general lead to rather large gaps. In thatcase, the usefulness of the dependency network itself can bedoubted. Consequently, the value manager does not main-tain such a data structure at all. Instead, it merely maintainscompiled dependency information with respect to context-defining assumptions in the labels associated with the data.This reduces the overhead to zero when removing redundantor unimportant data. The resources additionally required forredrawing inferences depend on the desired degree of com-pleteness. Completeness with respect to the original problemis not a realistic goal when dealing with numeric constraintnetworks, as is illustrated in Example 1 below. If we are sat-isfied with at least one (possibly not minimal) justification perdatum, no redrawing is necessary at all during the investiga-tion of a certain context. A slightly more liberal strategy sup-presses the redrawing of consequences only for data whichis classified as unimportant by the value manager. Indepen-dent of the chosen strategy, completeness of labels cannot beguaranteed any more at any time, not even with respect to thecurrent focus.

Example 1 Let CN = 〈x, y,R2, c1, c2, c3, c4, c5〉 be aconstraint network which contains the following constraints:

c1 : y = x + 1 c4 : y < 10c2 : y = x ∗ 2 c5 : y < 10000c3 : y > 3

Let further for all i ∈ 1, . . . , 5, Ai denote the assump-tion that ci holds. The first two constraints obviously haveonly one solution: x = 1 and y = 2. Therefore, the first threeconstraints cannot hold at the same time, and consequently,A1, A2, A3 is a conflict. But this conflict cannot be foundby local propagation because the variable domains are notbounded. If we perform domain reduction based on the firstthree constraints, the lower bounds of the variable domainsare pushed up steadily. After more than 2000 constraint eval-uations, starting with c3 and than iterating over c1 and c2, itbecomes clear that no solution exists within the range of dou-ble precision machine numbers. But even this result does notguarantee that there is no solution at all. A fix point is neverreached. Also domain splitting does not help to identify theconflict.

In spite of the high number of potential propagation steps,which can be performed to compute reduced domains usingdifferent parts of the constraint network, the datum 〈y, 〉 de-noting falsity can be obtained in less then 10 reduction steps

by simply focusing the propagation on the smallest value setsknown for each variable. Dependent on the order in whichthe constraints are evaluated the corresponding label will beA1, A2, A3, A4 or A1, A2, A3, A4, A5 respectively. Ob-viously, both are not the minimal conflicts with respect to theoriginal constraint problem.

3.3 Focusing and Data Reduction StrategiesFor efficient data management, focusing and data reductiontechniques have to cooperate in a productive way. While thefirst technique tries to avoid the generation of unnecessaryor unimportant data, the second aims at getting rid of ballastaccumulating during the reasoning process. Both strategiesmust be based on the same criteria of importance with respectto the task at hand. To apply data reduction without focusingdoes not make sense. In general, it is more efficient to avoidthe production of useless data in the first place than to removeit afterwards. So data reduction produces a real benefit onlyif applied to data which could not be avoided by focusing.

As described in Section 1, the current state of the art inATMS technologies includes focusing techniques which re-strict inference drawing to consequences of interesting com-binations of assumptions. Especially when reasoning aboutnumeric values, focusing on contexts of interest is necessaryfor efficiency, but not sufficient. In Example 1 we have seenthat focusing inferences within a single context is required aswell.

Publications in the area of model-based reasoning whichaddress the subject of focusing within a single context arerare. However, an equivalent to the idea to focus inferenceson most restricted values can be found in [Goldstone, 1992].In this paper, a diagnostic system called Skordos is described.The concepts discussed address the difficulty of managingthe tremendous number of possible predictions in quantita-tive value set propagation algorithms. Intervals of continuousdomains are represented by inequalities like x ≤ 7. A pro-cess called hibernation delays propagation of some inequal-ities until they become important for diagnostic reasoning.The importance of data is measured by its usefulness for find-ing new conflicts within the focused contexts1. The proposedstrategy starts with the computation of consequences only forthose inequalities which are the mathematically strongest fora certain variable in at least one focused context. All otherconsequences are delayed until conflicts have been found.

The more carefully the inference process is controlledwithin the focused contexts, the higher is the overhead forinference selection. For instance, to decide whether to com-pute consequences for a newly derived datum, in [Goldstone,1992] an algorithm is used which determines for each focusedcontext the corresponding mathematically strongest inequali-ties. The worst case complexity of that algorithm is quadraticwith respect to the number of focused contexts. To focus in-ference drawing on those steps, whose results are valid in atleast one focused context, additional checks on the set of an-tecedent nodes are necessary. While determining whether a

1In fact, [Goldstone, 1992] defines hibernation directly on diag-nostic candidates, but replacing the set of candidates by an arbitraryset of focused contexts is a straightforward generalization.


Figure 1: The value manager

set of data holds in at least one focused context can be or-ganized very efficiently2, the costs of testing mathematicalstrongness depend on the kind of data used. If state equationswith interval sets as values are used, the costs are in the sameorder of magnitude as the costs of the inference step itself.

To avoid these extra costs, the value manager restricts rea-soning to a single focused context at a time. Of course, thisincreases the number of necessary focus adjustments. Insteadof investigating a set of candidates simultaneously, a diagnos-tic problem solver using a value manager is forced to investi-gate one candidate after another, state by state. Between ev-ery two investigations, the focused context has to be changed.The main reason why this strategy is more efficient for thevalue manager is the fact that the value manager forgets. Ascan be seen in Section 4, in relevant applications the amountof data produced during the investigation of a context is byorders of magnitude larger than the amount of data the valuemanager remembers after the investigation. When changingthe focusing environment, still a search for the mathemati-cally strongest data for the new focus is necessary. However,the amount of data to be checked is strongly reduced sinceno comparisons are necessary between intermediate results.A second reason is the explosion of focused contexts, whichis caused by special assumptions needed for disjunction en-coding (see Section 3.5). Without further focusing within theset of all interesting contexts, the selection of useful inferencesteps becomes extremely expensive.

The strategy of focusing on the mathematically strongestdata defines the importance of data with respect to the cur-rently investigated contexts. For the development of efficientdata reduction strategies, this criterion is helpful, but we haveto take into account another aspect of importance: The rele-vance of a datum for further context investigations.

The value manager provides a framework in which mem-ory management is divided into short-term and long-termmanagement. While short-term memory management isbased on the former aspect of importance, the long-termmemory management is based on the latter. Since the valuemanager does not maintain a dependency network, it can ef-

2In [Tatar, 1994] a 2vATMS is described which checks whethera set of data holds within one of the given focused contexts in lineartime with respect to set size.

fectively separate the knowledge about the currently investi-gated context from the knowledge about previously investi-gated ones. We call the memory for context specific knowl-edge context buffer and the memory for learned knowledgeabout other contexts value database. Each memory main-tains a set of conflicts, a set of value manager nodes – eachcomprising a datum, an ATMS-like label and possibly someother administrative information – and also means to look upnodes for a given variable efficiently (see Figure 1).

Short-term memory management is performed during theinvestigation of a context. Whenever a new datum togetherwith at least one justification is added to the context buffer,the data reduction strategy decides whether to store it, to for-get it, to combine it with one of the currently maintainednodes for the same variable, or to replace some of these nodesby it. This step includes value set manipulations as well asATMS-like label computations. Several strategies have beentested during the development of our reference implementa-tion. For models with a high proportion of quantitative re-lations, the far most efficient one turned out to be a strat-egy which computes intersections whenever possible and foreach variable only keeps the node with the most restrictedvalue set3. The main advantages of this strategy are the fastconvergence and the limited memory consumption, which re-mains constant during propagation. While the value manageris open for strategies which also keep other than the mostrestricted value sets in memory, it expects all strategies tocompute the most restricted value set and provides a specialinterface method to access the corresponding value managernode.

Long-term memory management is performed during con-text changes. Changing the focused environment of a contextbuffer includes knowledge transfer between the buffer and thevalue database in both directions. First, nodes which havebeen added to the context buffer after the last context changeare selected for saving in the long-term memory. Addingclones of those nodes to the value database can include somereorganization, for example removing other nodes which arenot necessarily needed any more and are unlikely to be use-ful in future. Then, the environments in the labels which are

3For debugging purposes, it is useful to maintain more than onenode in case of conflicting data.


maintained by the context buffer are restricted to subsets ofthe new focused environment. Nodes with empty label areremoved from the buffer. Finally, nodes from the databasewhich are valid in the new focused context are cloned andadded to the buffer. This addition adjusts the labels of thenodes to subsets of the new focused environment and makesuse of short-term memory management, which can includeintersection computations. The selection strategy of data tobe stored in the value database, as well as the memory reor-ganization strategy, are not fixed within the value manager.For the experiments in Section 4, a very simple strategy wasused. It adds all final context investigation results to the valuedatabase without any reduction on data level.

3.4 Supporting Inference SelectionBesides the strategic conflict information to direct the prob-lem solver’s search, the value manager also provides tacticalsupport on inference selection level. After adding a newlyderived state equation together with the corresponding Hornjustification to a context buffer, the need for computing itsconsequences depends on whether the addition changed theavailable knowledge about that variable or not. Since thedata reduction strategies of the value manager compare thenewly added data with existing data for the same variable, thevalue manager can provide information about relevant valueset reductions without extra costs. For that purpose, each con-text buffer maintains a list called open nodes. When addingnew data to a context buffer, all modified and all newly cre-ated value manager nodes are added to that list. The problemsolver can access that list whenever convenient to select newtasks for the agenda. After each access, the list is automati-cally cleared. The list of open nodes is also modified by thecontext buffer when changing the focused context. By track-ing relevant changes during the update of the set of all main-tained value manager nodes, all nodes which were affectedby the context change can be identified. Local propagationbenefits from that information, because the number of tasksinitially put on the agenda can be reduced.

It should be emphasized that the open node list mechanismstrongly differs from the consumer mechanism used in clas-sical ATMS approaches (see [de Kleer, 1986c]). Both mech-anisms have the same goal, namely to avoid unnecessary in-ferences, but the means are quite different. The consumermechanism allows the problem solver to attach markers (socalled consumers) to RMS nodes which indicate the conse-quences that should be computed for the node (and also con-tain the code to perform the necessary inferences). Propaga-tion is performed by selecting one of those markers, remov-ing it from the corresponding RMS node and performing thecorresponding inferences. Advantages of this mechanism arethat no inference needs to be drawn twice, even after con-text changes, and that the responsibility for completeness ofthe inference control is completely assigned to one compo-nent (the RMS). On the other hand, considerable managementoverhead is generated for inferences with more than one an-tecedent node. The more the inference process is focusedwithin a focused context, the more useless consumers arecreated but never removed. For dependency trackers whichapply data reduction, the consumer mechanism is even less

suited since removing a node with an attached consumer mayaffect completeness.

3.5 Disjunction HandlingTo solve cyclic dependencies in physical systems, inferencemethods which go beyond local propagation are needed. Cur-rent approaches combine different techniques such as localpropagation, domain splitting and network decomposition.Since results of domain splitting steps do not have Horn jus-tifications, the value manager has to be extended to support abranch&prune solver as described in [Lunde, 2005].

We first focus on the logical problem of computing soundlabels. Let L(eq) denote the label of the state equation eq. La-bels are sets of sets of assumptions and have the same mean-ing as within an ATMS. The logical relationship between twosplit equations 〈x, d1〉 and 〈x, d2〉 resulting from splitting thevalue set of a third equation 〈x, d〉 can be expressed by meansof two split assumptions A1 and A2. For both 〈x, di〉, we de-fine L(〈x, di〉) = e ∪ Ai|e ∈ L(〈x, d〉). Since both splitassumptions are related by disjunction, we can now definesplit assumption elimination based on hyperresolution. Let〈y, d1〉 be a consequence depending only on the first split as-sumption A1 and 〈y, d2〉 a consequence depending only onA2. A sound, minimal, and consistent label for 〈y, d1 ∪ d2〉is obtained by removing supersets of contained environmentsand known conflicts from the following label:

L = e | e ∈ L(〈y, d1〉) ∧A1 /∈ e∨ e ∈ L(〈y, d2〉) ∧A2 /∈ e∨ ∃e1 ∈ L(〈y, d1〉) ∃e2 ∈ L(〈y, d2〉) :

e = e1 \ A1 ∪ e2 \ A2Since falsity can be expressed by arbitrary state equations

which contain an empty value set, this specification also cov-ers conflict handling.

The next question is how to control hyperresolution withinthe value manager. The branch&prune algorithm investigatesin each recursion level the consequences of the split equations〈x, d1〉 and 〈x, d2〉 in a sequence. Therefore, the problemsolver could easily navigate the context buffer through bothcorresponding extended contexts (each defined by extendingthe original focused environment by one of the split assump-tions). Following this idea, hyperresolution inferences couldbe realized as an extension of long-term memory manage-ment. Unfortunately, this usage of the value manager dramat-ically increases the number of context changes, and reducesthe effectiveness of long-term memory management, since in-termediate results now find their way into the value database.The resulting system will spend quite a large amount of timewith context changes.

Therefore, the context buffer is extended instead. This ex-tension supports domain splitting within the context bufferand eliminates the need to communicate with the valuedatabase until the original context is completely investigatedor the problem solver loses the interest in it. The chosensolution exploits the depth first control strategy used by thebranch&prune algorithm. In spite of maintaining just one setof value manager nodes and one set of nogoods, the extendedcontext buffer maintains a tree called context tree which is


composed of context tree nodes. This tree reflects the hier-archical structure of context extensions generated by domainsplitting operations. Each context tree node comprises a fo-cused environment, a set of value manager nodes, a set of no-goods, and optionally a split assumption and a split equation.The root node is initialized and marked as current node whenchanging the context. Child nodes are added to the currenttree node whenever split operations are performed. The prob-lem solver gets means to navigate to certain tree nodes andto evaluate their children. Evaluation is based on hyperres-olution as described above and includes modification of thecontent of the current tree node and removal of the evaluatedchildren.

Compared to an extension for general disjunction handlingas suggested in [de Kleer, 1986b], the expressive power ofthe sketched functionality is rather limited. Nevertheless, it isvery efficient because no search is necessary to apply hyper-resolution, and because it supports removal of assumptionsand data which are not needed anymore, without additionalcosts.

4 Experimental ResultsA Java implementation of the presented concepts has beenintegrated into the commercial model-based engineering toolRODON (see [Lunde et al., 2006]) and tested in various ex-periments. The results of three of them are summarized inthe following. All measurements have been performed on astandard PC with 2.2 GHz and 512 MB RAM.

4.1 Automated FMEA Generation for anAutomotive System

The analyzed system of the first experiment comprises theelectrical equipment of the right door of a current car se-ries. The most important components are the electronic con-trol unit, the exterior mirror assembly, the door lock assem-bly, the window pane control motor, the switch assembly andsome bulbs. The corresponding Rodelica4 model is currentlyused in a commercial project by a major German car manu-facturer to generate decision trees for workshop diagnosis. Itcomprises 167 subsystems and 580 atomic components, andcovers more than 60 fault codes within the electronic controlunit. The constraint network is composed of 8863 variablesand 7338 constraints.

In this experiment, we focus on fault effect prediction forthe operational state ‘window pane manually up’. This taskincludes 288 state investigations; in 265 states, domain split-ting is activated, which leads to 1290 investigations of ex-tended contexts. Table 1 summarizes the obtained reductionwith respect to needed inference steps and computation timewhen using the value manager. The efficiency gains are sig-nificant even though no use is made of conflicts to direct a

4Rodelica is a dialect of Modelica, which is a standard-ized object-oriented language to describe physical systems in acomponent-oriented and declarative way (see www.modelica.org).Rodelica differs from Modelica in some details since it uses con-straints instead of differential algebraic equations to describe com-ponent behavior.

search. The comparatively small context navigation time con-firms the decision to focus on one context at a time.

computed data computation timetotal ctx-nav total

# [sec] [min:sec]Without value manager 45181934 0 16:55

With value manager 1051622 6.0 1:22

Table 1: Impact of the value manager on simulation perfor-mance

In spite of the large number of intermediate results, thesize of the value database is quite limited at the end of theanalysis. Only 23799 value manager nodes are maintained.Justifications are not maintained at all, and the storage con-sumption of the environments, which directly depends on themaximal size of the assumption database, is also limited. Allin all 1611 assumptions are introduced during the analysis,but thanks to the removal of split assumptions when evalu-ating extended contexts, the size of the assumption databasenever exceeds 609. As a consequence, the memory space forlabel management is reduced by more than 50 percent and theperformance of label computations is improved.

4.2 Model-based Diagnosis of a Fly-by-WireSystem

The pitch elevator control system [Lunde, 2003] of the nextexperiments is a typical fly-by-wire system. It consists ofan electronic control unit called primary flight control unit(PFCU) which controls the angle of a pitch elevator surfaceby means of an electro-hydraulic servo valve and a hydrauliccylinder. The top-level layout of the system (see Figure 3)also includes a power supply unit (PSU), three redundant po-sition sensors, and some electrical wires. Figure 2 showsthe actuator part of the system in more detail. Here, electri-cal signals are converted into hydraulic flows and finally intomechanical movements. Besides the three main components,some redundant components have been added to the design,to keep the system in a safe state in case of faults. To control

Figure 2: The actuator part of the pitch elevator control sys-tem


the surface angle, the PFCU compares the actual surface po-sition with the required angle and uses the deviation to adjustthe position of the electro-hydraulic servo valve. These ad-justments determine the movement of the piston in the cylin-der, which finally changes the actual angle of the surface.

As an example of an interesting diagnostic case we ana-lyze the observed response of the system to a control com-mand from the cockpit which requires the surface angle tochange by 5 degrees. Starting with an initial angle of 0 de-grees a movement into the right direction is observed but, dueto a defect, the movement does not stop at the angle of -5degrees.5 We want to know which faults can explain the ob-served behavior.

Our Rodelica model of the pitch elevator control system isagain component-oriented and exactly matches the structureshown in Figures 2 and 3. The system behavior is definedby 621 variables and 501 constraints. The main differencewith respect to the automotive system of the first experimentis that this model is dynamic to a great extent. To providethe PFCU with a realistic feedback from the controlled com-ponents, difference equations are used in several parts of thesystem, e.g. in the cylinder. They compute Euler steps forthe corresponding differential equations, which describe thebehavior on physical level. A second difference is that reli-able predictions are achievable even without domain splitting.Therefore, the experiments were performed with local prop-agation only. Due to the fact that most dynamic state vari-ables are continuous (e.g. the cylinder position), we cannotexpect too much efficiency gains by reusing results from pre-vious state investigations. But here, the conflict computationcontributes to our application, since it is basically a searchproblem.

In the second experiment, we simulate the response of thesystem to the cockpit command in nominal mode. To this pur-pose a sequence of 30 states is computed in which the initialvalue ranges of the dynamic state variables of each state aredetermined by the corresponding predecessor states and thedifference equations. After 25 states, the surface angle con-verges at -5 degrees. Table 2 shows, that the value managerstill improves the simulation performance, though the gainsare not as impressive as in the automotive example. The highabsolute number of inferences highlights the importance ofdata reduction.

computed data computation timetotal ctx-nav total

# [msec] [msec]Without value manager 45318 0 1200

With value manager 30022 100 892

Table 2: Impact of the value manager on simulation perfor-mance

In the last experiment, we use the GDE based diagnosticengine of our reference implementation to diagnose the de-scribed symptom. For this purpose, we restrict the range of

5In reality, this wrong behavior is detected by some monitors,and fault compensation functions are activated. But this mechanismis out of scope here.

Figure 3: The pitch elevator control system

the actual surface position variable in the thirtieth state to theinterval [-∞ -6] and start diagnosis on that data. The searchspace is defined by the 26 component fault mode variablesand their possible values. The model contains 50 single faults.The number of double faults is approximately 2400.

During the initial check of the candidate ‘system ok’ a con-flict occurs in state 30 because the actual surface position ispredicted to be around -5 degrees, which is out of the spec-ified range. The corresponding conflict is mapped back (seee.g. [Tatar, 1996]) to the initial state. It comes out that nom-inal mode assumptions of 11 components are involved. Thisinformation reduces the search space, because it proves that15 of the 26 suspicious components cannot explain the symp-tom, at least not by single fault. During the diagnostic pro-cess, the consistency of 61 candidates is checked with respectto the specified symptom. Conflict back-mapping leads to11 additional conflicts between the initial fault state assump-tions. At the end, the following three minimal candidates re-main, which are guaranteed to be the only explanations withinthe scope of single and double faults.

• LVDT Exc C H disconnected

• actuatorLVDTSensor disconnected

• LVDT Act V1 disconnected & LVDT Act V2 discon-nected

Figure 3 shows the top-level view of the pitch elevator con-trol system with the corresponding components highlighted.Our reference implementation needs 26.3 seconds for thecomputation.

In this diagnosis, the conflict sets computed by the valuemanager lead to a reduction of the search space size frommore than 2000 to just 61 candidates. This result emphasizesthat dependency tracking is very useful to solve explanatoryproblems, even if the underlying model is characterized bya low abstraction level and includes continuous dynamic be-havior. The level of label completeness which is provided bythe value manager has shown to be adequate for this applica-tion.


5 ConclusionQuantitative reasoning about real physical systems usuallyleads to a huge amount of intermediate results, which maycause severe complications if the reasoning process is sup-ported by a classical reason maintenance system. As shownin this paper, effective data reduction is crucial for efficientdependency tracking. The presented value manager is de-signed as a light-weight alternative to an RMS. It completelyavoids inference caching and concentrates on ATMS-style la-bel computation.

Although developed for a special model-based analysistool, the concept of the value manager is rather general. Itcan be utilized to support any problem solver which reasonsabout values of variables and provides Horn justifications forall inferred results. The suggested solution for disjunctionhandling requires the problem solver to evaluate disjunctionsin a special depth-first order. It is especially efficient in com-bination with a solver which is based on domain splitting.

Two applications have been discussed, which confirm theimportance of data reduction and demonstrate the efficiencyof the value manager. The significant difference in their char-acteristics also indicates that scalability is necessary for awidespread applicability in reliability analysis and diagnosis.The value manager is flexible regarding the actually used datareduction strategies, and thus well-prepared for task specificadaptations. Dependency tracking costs and the benefits ob-tained for the analysis task at hand can be balanced effec-tively.

The presentation in this paper focuses on performance withrespect to single processor computers, but special care hasbeen taken to support parallel computing as well. The num-ber of context buffers within a value manager is not limited.A problem solver can benefit from this feature by delegat-ing candidate checking to different threads. Each thread canopen its own context buffer to access data. Since the datawithin the buffers are physically separated from the data ofthe commonly used value database, synchronization is onlyneeded when changing the context of one of the buffers.

References[de Kleer and Williams, 1987] J. de Kleer and B. C.

Williams. Diagnosing multiple faults. Artificial Intelli-gence, 32:97–130, 1987.

[de Kleer, 1986a] J. de Kleer. An assumption-based TMS.Artificial Intelligence, 28:127–162, 1986.

[de Kleer, 1986b] J. de Kleer. Extending the ATMS. Artifi-cial Intelligence, 28(2), 1986.

[de Kleer, 1986c] J. de Kleer. Problem solving with theATMS. Artificial Intelligence, 28(2):197–224, 1986.

[Doyle, 1979] J. Doyle. A truth maintenance system. Artifi-cial Intelligence, 12:231–272, 1979.

[Dressler and Farquhar, 1991] Oskar Dressler and AdamFarquhar. Putting the problem solver back in the driver’sseat: Contextual control of the AMTS. In Joao P. Martinsand Michael Reinfrank, editors, Truth Maintenance Sys-tems (ECAI-90 Workshop), volume 515 of Lecture Notesin Computer Science, pages 1–16. Springer, 1991.

[Forbus and de Kleer, 1988] K. Forbus and J. de Kleer. Fo-cusing the ATMS. In Proceedings of AAAI’88, pages 193–198. MIT Press, 1988.

[Goldstone, 1992] David Jerald Goldstone. Controlling in-equality reasoning in a TMS-based analog diagnosis sys-tem. In Readings in model-based diagnosis, pages 206–211. Morgan Kaufmann Publishers Inc., 1992.

[Kelleher and van der Gaag, 1993] Gerry Kelleher andLinda van der Gaag. The lazy RMS: Avoiding work inthe ATMS. Computational Intelligence: An InternationalJournal, 9(3):239–253, 1993.

[Lunde et al., 2006] K. Lunde, R. Lunde, and B. Munker.Model-based failure analysis with rodon. In Proceedingsof ECAI’06, Italy, 2006. (to appear).

[Lunde, 2003] K. Lunde. Ensuring system safety is more ef-ficient. Aircraft Engineering and Aerospace Technology:An international Journal, 75(5):477–484, 2003. ISSN0002-2667.

[Lunde, 2005] R. Lunde. Combining domain splitting withnetwork decomposition for application in model-based en-gineering. In Armin Wolf, Thom Fruhwirth, and MarcMeister, editors, 19th Workshop on (Constraint) LogicProgramming W(C)LP 2005, number 2005-01 in UlmerInformatik-Berichte, pages 29–40. University of Ulm,Germany, 2005.

[Tatar, 1994] M. Tatar. Combining the lazy label evaluationwith focusing techniques in an ATMS. In Proceedings ofECAI’94, Amsterdam, the Netherlands, 1994.

[Tatar, 1996] M. Tatar. Diagnosis with cascading defects. InProceedings of ECAI’96, Budapest, Hungary, pages 511–518, 1996.


A Supervision Architecture to Deal with Disruptive Events in UAV Missions

Rachid El Mafkouk (*) , Jean-François Gabard, Catherine Tessier (**)(**) Office National d’Études et de Recherches Aérospatiales (Onera)Département Commande des Systèmes et Dynamique du Vol (DCSD)

2 avenue Édouard-Belin, 31055 Toulouse cedex 04, [email protected], [email protected]

(*) at Onera-DCSD for a training period April-Sept. 2005

Abstract

This paper presents a generic supervisionarchitecture dedicated to autonomous response to disruptive events for a UAV.

We consider a UAV whose mission may bedisrupted by internal or external events (e.g.failures, weather situation, interfering aircraft…),and we make the assumption that the environmentis such that the UAV cannot communicate with theground segment to deal with these events.

The same approach is used for modelling andmonitoring the nominal mission and for the designof the reaction and replanning strategies; specialattention is paid to the management of multipleconcurrent events, and a classification is proposedaccording to their impact on the mission, whichallows event combining rules to be designed; twomain types of reconfiguration strategies areconsidered, depending on the disruptive eventseriousness: those implying an immediate safetyreaction before replanning, and those enabling toengage a replanning process without preliminarysafety procedures.

The architecture is implemented withProCoSA, an asynchronous Petri net-based tooldedicated to mission monitoring and procedureexecution in autonomous systems; formalverification tools offered by ProCoSA are used tovalidate the architecture, and scenarios are testedin a simplified simulation environment.

Contrary to e.g [Hamilton et al., 2001] whohave designed and tested RECOVERY, anheterogeneous knowledge-based diagnosis methodfor AUV internal failures, the paper does notaddress diagnosis in itself, but rather how to use

the results issued from the Detection and Isolationfunctions of the FDIR process for Reconfiguration.

1. Introduction

Onboard decision capabilities allow an uninhabitedvehicle to reach mission objectives taking into accountdisruptive events. Decisional autonomy is necessary whenthe vehicle manoeuvres in a partially known, dynamic andhostile environment, such that the communication with theoperator may not be available anytime.

Research on autonomy is done for ground robots,Uninhabited Aerial Vehicles (UAVs), AutonomousUnderwater Vehicles (AUVs) and space vehicles.Autonomy is characterised by the level of interactionbetween the vehicle and the human operator: the higherlevel the operator’s decisions are, the more autonomousthe vehicle is. Between teleoperation (no autonomy) andfull autonomy (no operator intervention), there are severalways to allow a system to control its own behaviourduring the mission [Clough, 2002].

One way to make the vehicle autonomous is toimplement onboard decision capabilities to allow thevehicle to perform the mission even when the initial planprepared offline is no more valid. Decision capabilitiesmust be implemented within the closed loop perception,situation assessment, decision, action and includeautonomous response to disruptive events. The DCIsystem [Schreckenghost et al., 2005] provides two datamonitoring and event detection capabilities: the EventDetection Assistant (EDA) triggers simple conditionalevents whereas the Complex Event RecognitionArchitecture (CERA) detects situations consisting of setsof events organised temporally and hierarchically. EDAmonitors telemetry data and CERA compares incomingdata to pre-defined event conditions such as logicalrelations on incoming data. Events together with urgency


information are then presented to the user. The same kind ofevent generation is used in [Barbier et al., 2006a] but it isimplemented within a UAV onboard architecture includingmonitoring and replanning tasks in order to avoid systematicreturn to base and proceed with the mission autonomouslygiven the new constraints.

This paper goes further in so far as the main focus is theprocessing of multiple concurrent events in the autonomousreconfiguration phase.

Let us consider a UAV whose mission may be disruptedby internal or external events (e.g. failures, weathersituation, interfering aircraft, threats...) The environment issuch that the UAV cannot communicate with the groundsegment to deal with disruptive events. The nominal missionof the UAV is defined through a set of operational tasks (i.e.tasks involving payloads) and non-operational tasks (take-off, waypoint rejoining...) Operational tasks are described assets of legs that include waypoints to be rejoined. Thenominal flight plan is described as a list of tasks, legs andwaypoints requiring or not the use of one or several payloadmodes. The nominal mission is represented by a set ofProCoSA Petri nets (see Appendix) allowing the currentactivities to be monitored on-board.

A classification of the possible disruptive events is givenin the second section, so as a way to deal with severalconcurrent events. The strategies enabling concurrentdisruptive events to be dealt with are described in the thirdsection, and the software environment used for the firstsimulation tests is presented in the fourth section.

Appendix A gives the main features of ProCoSA, whichis used for implementing the supervision architecture, andAppendix B is a short reminder about Petri nets.

2. Disruptive events

A disruptive event is a logical condition on the values ofparameters coming from the telemetry frame.Example: (Frame_OK AND sensor_block OK ANDRPM_parameter OK AND RPM_measure < N) OR(Frame_OK AND pilot_block OK AND EngineOff = true) isthe event corresponding to an engine failure, with RPM therotation speed of the engine.

Many kinds of disruptive events may occur during amission, and multiple events have to be considered. In orderto avoid the combinatorial aspect of multiple events,individual events are classified according to their impacts onthe mission: absorbing events (e.g. engine failure), safety-related events (e.g. interfering aircraft), mission-relatedevents (e.g. payload failure), communication-related events(e.g. telemetry failure). This allows event combining rules tobe designed, e.g. a catastrophic event masks any other kindof event, the constraints of two (or more) safety-relatedevents are considered together, etc.

2.1. Event classificationThe following classification is proposed:

• absorbing events (EA) lead to mission abortion. Theycannot be recovered and the reaction amounts to makethe UAV land as smoothly as possible. When such anevent occurs, the processing of any other kind of eventsis aborted and no further incoming event can beprocessed. Example: engine total failure;

• safety-related events (ES) lead to modifying the flightprofile or the flight plan - e.g. route change for a while -which may induce delays or new constraints on the useof the payload. Examples: interfering aircraft, newforbidden area, turbulence;

• mission-related events (EM) only have consequences onthe mission itself. Replanning amounts to adapt themission to the new constraints, e.g. remove waypoints.Examples: camera failure, violated temporal constraint,new mission goal;

• communication-related events (EC) are related tocommunication breakdowns between the UAV and theground. Such events result in the UAV being fully“autonomous” therefore it has to proceed with themission as planned. Example: telemetry failure.

Remark: the UAV can detect events only if the relevantinformation is available either from its own sensors or fromcommunication. In case a sensor or communication breaksdown, some information is no more available andconsequently some disruptive events may be missed. Thiswould also be the case for an inhabited aircraft.

2.2. Event combining rulesThe assumption is made that events occur asynchronously,therefore the occurrence of simultaneous events is notconsidered. The following rules are set to deal with twosuccessive events, i.e. such that the second event occurswhile the first one is dealt with:

• an absorbing event EA has priority over any other typeof event;

• a safety-related event ES has priority over a mission-related event EM;

• two successive safety-related events ES1 and ES2 aredealt with within a unique process: the flight plan isupdated taking account of the constraints resulting bothfrom ES1 and ES2. Should some constraints beincompatible, the most time-critical or safety-critical aredealt with first;

• two successive mission-related events EM1 and EM2 aredealt with within a unique process: the mission isupdated taking account of the remaining availableresources, resulting in a degraded mission plan;


• communication-related events EC do not interfere withthe other types of events: indeed they are not dealt withexplicitly and the UAV goes on with the processing ofthe other events it can be aware of.

Table 1 sums up the combining rules: considering the typeof a first event and the type of a second event occurringwhile the first one is being dealt with, the result indicates thetype of the event that will actually be dealt with. When twoES or EM events occur successively, notations E2

S or E2M

mean that the on-going reconfiguration procedure will bererun on a new set of constraints built from the constraints ofboth events.

first event

second event

EA ES EM EC

EA EA EA EA EA

ES EA E2S ES ES

EM EA ES E2M EM

EC EA ES EM EC

Table 1: event combining rules

As a matter of fact the rules can be applied recursively to nsuccessive events as E2

S are ES –type events and E2M are EM

–type events. This will be one of the main points featured bythe reconfiguration strategy.

3. Dealing with disruptive events

The strategy that is designed to deal with disruptive events isa two-step process: (1) a pre-processing of incoming events(applying the combining rules defined previously shouldsuccessive events occur) and (2) reconfiguration procedures.

3.1. Event pre-processingA Valid Event Generator (VEG) is designed in order to filterthe events coming from the onboard telemetry frameaccording to the current state of the mission – especially theevents that are currently being dealt with – and to thecombining rules (see section 2.2). Consequently only validevents (noted VE) are actually dealt with.

In order to deal with multiple successive events andconsidering the fact that combining rules can be appliedrecursively, the rank of a valid event is defined as follows:• a valid rank1-event (VE1) is issued by the VEG when

the state of the mission is nominal (there is no on-goingreconfiguration process). A rank1-event will trigger arank1-decision.

• a valid (VE2) rank2-event is issued by the VEG whileanother event is being dealt with. A rank2-event willtrigger a rank2-decision.

Table 2 shows the event that is issued by the VEG accordingto the event being processed and the incoming event:

event being processedincoming event

VE1A VE1S VE1M VE1C

EA VE1A VE2A VE2A VE2A

ES VE1A VE22S VE2S VE2S

EM VE1A VE1S VE22M VE2M

EC VE1A VE1S VE1M VE22C

Table 2: events issued by the VEG

Let us explain the first two columns (the explanations aresimilar for the last two):• when an EA event is being processed (VE1A), no other

event can be dealt with;• when an ES event is being processed (VE1S): if the

incoming event is an EA event, it is dealt withimmediately as a rank-2 event (VE2A); if the incomingevent is an ES event, both events are dealt with togetheras a rank-2 event (VE22

S); the other incoming events arenot dealt with.

Remark: the events that are not dealt with because they arefiltered by the VEG are not “forgotten”: depending on thestate of the mission, they can be dealt with afterwards.Example: EM event payload failure can be dealt with if theprocessing of ES event interfering aircraft has led to a newplan that allows payload legs to be performed.

The following ProCoSA Petri net (figure 1) synthesiseshow multiple events are taken into account:

Figure 1: decision_levels Petri net


The state of the mission is one of the following:• nominal state;• on-going reconfiguration in case of a single disruptive

event (on-going rank1-decision); if the process isinterrupted by a VE2 event, the VE1 event ismemorised for possible later processing;

• on-going reconfiguration in case of multiple disruptiveevents (on-going rank2-decision); the loop on the placeassociated with this state represents the possiblerecursion on event combining rules; if the VE1 and VE2events belong to the same category, a newreconfiguration problem is solved, taking into accountboth sets of event parameters; if the VE2 event belongsto a higher priority category, the pending VE1 eventmay be processed later if the state reached by theaircraft after the processing of VE2 allows it.

Remark: considering two decision ranks (and therefore twoevent ranks) is relevant because:• dealing with a disruptive event when the mission is

nominal and when the situation is already degraded maylead to different decisions;

• for a particular mission, the possibility is offered to “cutthe loop” and only allow two successive events to bedealt with (a third one would trigger a Return To Base).

3.2. Reconfiguration proceduresWhatever the decision rank, two reconfiguration strategiesare designed so as to cope with (1) disruptive events that callfor an immediate reaction to secure the UAV and (2)disruptive events for which such an immediate reaction isnot needed:• Reaction-Then-Replanning (RTR) is triggered for VEA

and VES events: such events affect the flight safety andrequire an immediate reaction, before flight planreplanning.

Examples: for VES event interfering aircraft, the reactionconsists in immediately modifying the UAV route accordingto the rules of the air; the flight plan is replanned afterwards.For VEA events (e.g. total engine failure), the reactionconsists in trying to reach the closest landing field; mostoften the mission is aborted and there is no replanning.• Reaction-With-Replanning (RWR) is triggered for VEM

events: a smooth reaction may be performed (usually aholding pattern) during which a corrected plan iscomputed.

Example: for VEM event payload failure, the UAV may fly aholding pattern while the legs involving the failed payloadare removed from the plan and a new mission plan iscomputed.

The reaction and replanning computations are based ongeneric procedures that are implemented as dedicated

processes within ProCoSA. The on-line computation of arelevant reaction and a corrected plan amounts to selectingone or several procedures and instantiating them with thecurrent mission and UAV parameters.

- The ProCoSa Petri net shown in figure 2 implementsthe way the current plan is modified by the reactions andreplanning that are triggered by the reconfigurationstrategies.

Figure 2: Planner Petri net

• a VEA event leads to an emergency landing procedure(the closest landing field is chosen among the possibleemergency landing fields) and the mission is aborted: itis an RTR with no replanning phase;

• a VES event triggers an RTR;• a VEM event triggers an RWR; the execution of the new

plan has to be appended smoothly to the end of theholding pattern (place synchro_holding_pattern);

• a VEC event is just memorised and the UAV goes onwith its current plan.

4. Simulation environment and first results

4.1. Simulation environmentBesides the architecture we have developed for flight testing[Barbier et al., 2006] that only deals with single disruptiveevents, a simplified software simulation environment hasbeen built in order to validate the processing of multiplesuccessive events. This environment includes two maincomponents, which are developed as ProCoSA sub-systemfunctions (see Appendix A):• the Valid Event Generator (VEG) process;• the simulation process.

The behaviour of the VEG is implemented as describedin Table 2. A simplified user interface allows any kind ofdisruptive event to be entered anytime during the course of


the simulated mission. The VEG sends triggering events tothe ProCoSa procedure Petri nets dedicated to disruptiveevent management, like the decision_levels Petri net (figure1).

The simulation process includes three main functions:• mission data acquisition;• nominal mission execution;• replanning actions.

Two data files are used for the nominal mission:• the first one is dedicated to the flight plan, which is

described as a list of waypoints; each waypoint is atriplet (waypoint type, required fly-over time, payloadsto be used); three payloads are considered;

• the second one contains the locations of emergencylanding fields that are available for the mission.

As soon as the mission data are read, the mission isexecuted and monitored. The flight plan is executedsequentially, until it is interrupted by a disruptive event.

Several simplified reactions and replanning actions havebeen implemented for individual events, among which:• EA type events: the closest emergency landing field to

the current waypoint is selected;• interfering aircraft (ES type event): the reaction

procedure is an immediate trajectory change forcollision avoidance; the replanning strategy aims atmodifying (or not) the initial flight plan so that thewaypoint fly-over times are satisfied;

• partial loss of engine power (ES type event): thereplanning function aims at maximising the number ofoperational tasks to be performed, taking into accountthe waypoint fly-over times, especially for the FEBA1

exit point; no immediate safety reaction is needed inthis case;

• payload failure (EM type event): the replanning functionelaborates a new plan with the remaining payloads.

4.2. First results

Formal validation

The ProCoSA verification tool (see Appendix A) allowsformal properties of the Petri nets to be checked. This toolhas been used to check the consistency of the nominal andreplanning procedure Petri nets.

Simulations

Simulation runs have enabled the correct behaviour of thereplanning functions to be checked, especially in case ofmultiple successive events. Indeed no pre-planned procedureexists for multiple events and what is checked is that the

1 Forward Edge of Battle Area

VEG filters the events correctly and that the computedreaction and replanning are relevant.

Example 1: two successive interfering aircraft (figure 3).The simulated UAV initiates a first 90º headingmodification to the right to avoid the first aircraft and thenincreases it to avoid the second one. Then the UAV entersthe replanning phase.

Figure 3: two interfering aircraft

Indeed both events are ES events and the associatedconstraints are aggregated. Therefore the second reaction isa global reaction to both events.

Example 2 : payload failure then partial engine failure(figure 4).When the first event occurs, the simulated UAV replans itsmission taking account of the failed payload (i.e. thewaypoints involving this payload are cancelled). When theengine fails (second event), the UAV speed is reduced andthe UAV goes on with the mission cancelling some morewaypoints so as to meet the FEBA time constraint.

Figure 4: payload + partial engine failure

WP3 WP4

WP2

WP1

WP5

1st interferingaircraft

end-replanning

2nd interferingaircraft

rank-2 reaction

rank-1 reaction

WP3

WP4

WP2

WP1

WP5

EM

end-replanning

rank-2 decision

rank-1 decision

WP6

WP7

WP8

ES


The payload failure is an EM event therefore a new plan iselaborated. The partial engine failure is an ES eventtherefore it has priority over the EM event: the VEG issuesthis event and the corresponding reaction is triggered.

5. Conclusion

This work has focused on a generic approach to dealwith multiple successive disruptive events in UAV missions:a classification of events has been given, together withcombining rules allowing a generic decision framework tobe designed. A major advantage is that only the decision andreplanning algorithms for single events are implemented:any event chain can be dealt with through the combiningrules, thus avoiding the combinatorial aspects of multipleevents.

The decision architecture is implemented with ProCoSA,which allows (1) nominal mission supervision and abnormalsituation management to be dealt with within the sameframework and (2) a generic supervision architecture to bedesigned: indeed the architecture is not dedicated to aspecific UAV for a specific mission, the classification ofdisruptive events allows generic reactions and replanningprocesses to be implemented and they are coded withinindependent subsystem software functions. Moreover, theProCoSA architecture can be implemented straight on-board.

Simulation tests have highlighted the robustness of thedecisions, which is due to Petri net modelling and theassociated analysis techniques.

On going work focuses on the following:• tests with real telemetry data and real time operating

systems are conducted to get more realistic simulationconditions;

• more elaborated replanning strategies [Chanthery et al.,2005] are considered to cope with real timerequirements and to deal with complex constraints (e.g.new threats, fuel consumption);

• situation assessment procedures including a predictionfunction are considered to help anticipating theoccurrence of disruptive events, especially in the case ofmultiple events.

Appendix A : ProCoSA

ProCoSA [Barbier et al., 2006b] is a software environmentmeant for controlling and monitoring highly autonomoussystems. System autonomy is usually obtained by puttingtogether various functions, among which:• data analysis (sensor data, monitoring data, operator’s

inputs);

• nominal mission monitoring and control (vehicle andpayload control actions);

• decision (management of disruptive events, replanning).

These functions are often developed as separatesubsystems and they have to co-operate in order to fulfil theautonomous system behaviour requirements for the specifiedmissions. More precisely, the needs are the following:• off-line tasks: specification of the co-operation

procedures between subsystem software, subsystemcoding for embedded operation;

• on-line tasks: procedure monitoring, event monitoring,and management of the dialog with the operator.

ProCoSA includes the following components:• EdiPet, a graphical interface for Petri nets which is used

both by the developer for procedure design and by theoperator for execution monitoring (figure 5);

• JdP, the Petri net player, which executes theprocedures, fires the event-triggered transitions of thePetri nets and synchronises the activation of theassociated sub-system functions; a socket-basedcommunication protocol allows data to be exchangedwith external sub-system software;

• Tiny, a Lisp interpreter that is dedicated to distributedembedded applications.

The ProCoSA procedures are modelled with interpretedPetri nets (see Appendix B):• triggering events are associated with transitions: a

validated transition is fired if and only if the associatedtriggering event occurs;

• triggered actions are also associated with transitions;they consist in messages sent to JdP when the transitionis fired, and the possible actions are: Petri net activationrequests, sub-system software function activationrequests, event generation requests.

Figure 5: EdiPet graphical interface


Timers can be programmed with ProCoSA: a specialactivation request enables a timer variable to be instantiated,which allows actions with a limited duration to be modelled.

The ProCoSA procedures are used to model the desiredbehaviours of the autonomous system; the hierarchicalmodelling features offered by ProCoSA enable to structurethe whole application in a generic way: at the highestdescription level, generic behaviours can be described,regardless of the characteristics of a given vehicle; at thelowest level, they specify the sequences of elementaryactions to be performed by the vehicle or the payloads; thismodular approach enables a quick adaptation to systemchanges (e.g. taking into account a new payload).

An important feature of ProCoSA lies in the fact thatthere is no code translation step between the Petri netprocedures and their execution: they are directly interpretedby the Petri net player, thus avoiding any supplementaryerror causes.

ProCoSA finally includes a verification tool, whichmakes use of the Petri net analysis techniques to check thatsome “good” properties are satisfied by the procedures, bothat the single procedure level and at the whole project level(that is to say taking into account inter-net connections); thefollowing properties are checked:• place safety (not more than one token per Petri net

place);• detection of dead markings (deadlocks);• detection of cyclic firing sequences (loops).

Appendix B : a Petri net reminder

A Petri net [Murata, 1989] <P, T, F, B > is a bipartite graphwith two types of nodes: P is a finite set of places; T is afinite set of transitions. Arcs are directed and represent theforward incidence function F : P × T → /N and thebackward incidence function B : P × T → /N respectively.The marking of a Petri net is defined as a function from P→/N: tokens are associated with places. The evolution oftokens within the net follows transition firing rules. Petrinets allow sequencing, parallelism and synchronisation to beeasily represented. An interpreted Petri net is such thatconditions and events are associated with transitions.

References

[Barbier et al., 2006a] M. Barbier, J.-F. Gabard, J.-H.Llareus, C. Tessier, J.Caron, H. Fortrye, L. Gadeau, G.Peiller. Implementation and flight testing of an onboardarchitecture for mission supervision. In UAVs 2006, 21stInternational conference on Unmanned Air VehicleSystems, Bristol, UK, April 2006.

[Barbier et al.,2006b] M. Barbier, J.-F. Gabard, D.Vizcaino, O. Bonnet-Torrès. ProCoSA: a software

package for autonomous system supervision. In CAR’06,1st Workshop on Control Architectures ofRobots,Montpellier, France, April 2006.

[Chanthery et al., 2005] É. Chanthery, M. Barbier and J.-L.Farges. Planning Algorithms for Autonomous AerialVehicle. In 6th IFAC World Congress, Prague, CzechRepublic, July 2005.

[Clough, 2002] B.T.Clough. Metrics, Schmetrics ! How theheck do you determine a UAV’s autonomy anyway ? InPerformance Metrics for Intelligent Systems Workshop.Gaithersburg, MA, USA, 200

[Hamilton et al., 2001] K. Hamilton, D. Lane, N. Taylor andK. Brown. Fault diagnosis on autonomous roboticvehicles with RECOVERY: an integrated heterogeneous-knowledge approach. In ICRA 2001, IEEE InternationalConference on Robotics and Automation, Seoul, Korea2001.

[Murata, 1989] Tadao Murata. Petri nets: properties,analysis and applications. In IEEE. 77(4): 541-580, April1989.

[Schreckenghost et al., 2005] D. Schreckenghost, C.Thronesbery and M.B. Hudson. Situation awareness ofonboard system autonomy. In i-SAIRAS 2005, 8thInternational Symposium on Artificial Intelligence,Robotics and Automation in Space, Munich, Germany,Sept. 2005.


Debugging Failures in Web Services Coordination

Wolfgang Mayer and Markus StumptnerAdvanced Computing Research Centre

University of South Australia[mayer,mst]@cs.unisa.edu.au

AbstractThe rise of Web Services over the past years offers anew development paradigm for distributed applica-tions: high level communication using exchange ofstructured XML data, using communication proto-cols orchestrated by workflow languages with com-plex control constructs. We study the use of model-based techniques that have been used for fault anal-ysis in imperative (Java) and concurrent (VHDL)languages in a Web Service environment, with thegoal of diagnosing Web service interactions speci-fied in BPEL4WS, using an Abstract Interpretationapproach.

1 IntroductionWeb services are currently gaining ground as a new paradigmfor distributed applications [Alonso et al., 2004], using Webprotocols and XML-based data formats to replace the tra-ditional middleware layer for communication between self-contained externally invokable applications, called services.The XML encoding and the standardised interface definitionssuch as provided by the Web Service Definition Language(WSDL) [Christensen et al., 2001] facilitates interoperabil-ity, making Web services a well suited basis for EAI andcross-company application integration. This has led to thedevelopment of service-oriented architectures, where coDe-fined as ”the ability to compose and describe the relationshipsbetween lower-level services. Although differing terminol-ogy is used in the industry, such as orchestration, collabora-tion, coordination, conversations, etc., the terms all share acommon characteristic of describing linkages and usage pat-terns between Web services. Web Services ChoreographyWorking Group Charter, http://www.w3.org/2003/01/wscwg-charter.htmlmplex applications (business processes) are as-sumed to be composed from interacting (Web) services.

The development of such service constellations requiresspecifying and programming the actual interaction patterns(referred to as choreography)1 between the different services,

1Defined as ”the ability to compose and describe the relation-ships between lower-level services. Although differing terminologyis used in the industry, such as orchestration, collaboration, coor-dination, conversations, etc., the terms all share a common char-acteristic of describing linkages and usage patterns between Web

and dedicated languages have been developed for this pur-pose, with BPEL4WS [Curbera et al., 2003] and OWL-S [owl, 2004] currently the foremost representatives. Theservices coordinated in this fashion could be themselves writ-ten in these languages, or implemented as traditional applica-tions with a WSDL interface. The fact that these languagesprovide high level process description constructs (easily map-pable to standard process design notations) while their XML-based code and data structures make them amenable to meta-data descriptions such as the various Semantic Web serviceproposals, invites speculation about automated composition,and repair, e.g., in the work on self-healing services depictedin [Ardissono et al., 2005], which assumes the ability to di-agnose individual services which may be written in arbitrarylanguages) and of diagnosing their choreography. Taking aslightly different approach, we look at a classical debuggingscenario for the choreography itself, and consider the task ofdiagnosing BPEL choreographies.

2 A Crash Course in BPEL4WSThe definition of BPEL4WS [Curbera et al., 2003] (short-ened to BPEL from here) is based on the assumption thatto realise the full potential of Web Services as an integra-tion platform, applications and business processes will needa standard process integration model to interact in a princi-pled fashion, and that model will need to support businessprocess executions: potentially long sequences of messageexchanges within stateful interactions run by two or more par-ties. Thus, need was perceived for a language to specify thepatterns of message exchanges and the description of processstates (since Web Services, as defined by WSDL, are essen-tially stateless).

BPEL has been defined to be used in two ways. Either it isused as a specification language for business protocols, whichdescribes patterns of possible message exchanges while ab-stracting away from specific process details (in particular,aspects that individual companies may want to keep out ofthe process definition, referred to as ”opaque”. In this case,BPEL definitions are nondeterministic (for example, the spe-cific choice between multiple offered selections could not beidentified ahead of time, what is specified is that a choicewill be made). They could also be understood as constraintson an actual execution. In this fashion, they could be used

services.” Web Services Choreography Working Group Charter,http://www.w3.org/2003/01/wscwg-charter.html.


as a machine readable specification that may be use as addi-tional information in debugging the actual code that imple-ments it (similar to [Stumptner, 2001]). (However, then thequestion arises why designers would not use common othertechniques, such as the various UML diagrams, for specifica-tion purposes.)

The second way, which we will focus on here, is to useBPEL as an executable language that describes the actualcomputations and message contents to be passed on at thecoordination level.

The business processes that BPEL defines are supposed tocoordinate Web Services that communicate according to thespecifications of WSDL [Christensen et al., 2001]. WSDLservices define named portTypes (corresponding to interfacesin normal programming languages) that specify individualoperations. An operation is a predefined exchange of mes-sages, which can be one-way (receiving), request-response,solicit-response (two-way, depending on which side starts theexchange) and notification (just sending). A service and mes-sage name together with a specification of the actual protocolused for sending the message (e.g., SOAP) is called an end-point.

BPEL builds on this structure by establishing partner linksthat send messages between service endpoints. A process alsopossesses variables that can be used to store state data andprocess history resulting from message exchanges. (WSDLparameters and therefore BPEL variable values are XMLstructures; the expression language specified to access thesestructures or parts of them is XPath.)

Example 1 A standard example frequently mentioned in lit-erature is the “Loan Approval” process (Figure 1). The pro-cess involves four processes, each represented as independentWeb Service: the client (Cl), the Risk Assessor (RA), theLoan Approver (LA), and the Financial Institution (FI).

Initially, the client Cl requests a loan of amount monetaryunits from FI . If the amount is above 10000, the applicationmust be studied in detail and is forwarded to LA for this pur-pose. Otherwise, if the amount is low, a risk assessment isobtained through the RA service. If the risk is low, approvalis automatic. High-risk cases are processed the same wayas applications involving large amounts. Once the decisionabout a request has been made, the Cl is notified.

Provided the application was approved, the client can thenwithdraw money from the account. In case the amount with-drawn is less than the approved amount, the client is notified,the credit balance is updated and the process ends. Other-wise, an exception is thrown, which subsequently triggers areply to the Cl. In addition to that, the risk assessment needsto be discarded as the client is no longer trustworthy.

In the rest of this paper, we are interested in the processdescription dealing with FI and consider the other processesas opaque processes.

Occurrences of individual actions in a BPEL process arereferred to as activities The basic message passing activitiesare invoke, receive, and reply, of which the firstrefers to initiating a one-way or request-response messageexchange. Variables are assigned values using the assignactivity. Basic activities can be grouped by various controlconstructs (with the usual semantics): while, sequence,and switch.

invokeinvoke

receive

assign

reply

receive

replythrowassign

reply

wda>amount wda<=amount Fault Handler:

RA LA

Clie

nt

update balance

message

accept

amountamount<10000 amount>=10000

risk=high

risk=low

accept=yes

FI

withdraw: 5000

approved

denied

request amount: 100

low

client,100

Client:risk=low

Client:balance=8000

Client:balance=8000

Client:risk=?

B

D

E

F

H

J

IG

C

A

K

Figure 1: Loan Application Process

The two most important constructs that express concur-rency and nondeterminism are flow and pick. The first,unlike a sequence, allows to express explicit synchronisa-tion arcs between the activities included in the flow. (Theseactivities can be basic or themselves be nested.) The pickactivity waits for a set of incoming messages; once one hasbeen received it is chosen for processing and the others areignored.

BPEL also permits the specification of temporal events,e.g., for defining wait periods and timeouts.

2.1 Fault handlingSince BPEL processes represent long-running business activ-ities, classic atomic transaction models were considered inap-plicable. Instead, BPEL relies on explicit reporting of faults,either through a set of system exceptions or by using thethrow activity, and their handling by dedicated fault han-dling activities. The set of activities that are terminated bythrowing a fault is defined by explicitly grouping activitiesin user-defined scopes. A scope is simply a group of activi-ties that are considered grouped for fault handling purposes.Scopes can be nested, and fault handlers can throw faults inenclosing scopes. (Conversely, a fault in a given scope A re-sults in the termination of all activities in A and all scopesnested in A.)

A key property for the fault handlers is that they may con-tain so-called compensation handlers. Transaction effects arenot assumed to be automatically rolled back; instead it isthe duty of the developers to program compensating actionsthat restore a correct overall system state, a classic conceptfrom long-running transaction research. An implication ofthis choice is that compensation handlers are completely ap-plication dependent.

2.2 Executable vs Non-executable BPELBPEL extensions for executable processes include the abil-ity to explicitly terminate the behaviour of a business pro-cess instance. It requires the use of input/output variables in


message-related activities (they can be abstracted out in non-executable processes). Also added are fault definitions in casean XPath expression on the right side of an assignment selectsno or more than one node; in case a variable or correlation isused before it is initialised; in case multiple receive ac-tions are enabled for the same partner, link, portType, opera-tion and correlation sets; and finally for multiple outstandingsynchronous requests.

Example 2 (cont’d) Modelling the process in Figure 1 inBPEL, the decision of whether to automatically assess anapplication or to undertake a more detailed analysis can bemodelled as a flow, where the choice which transition tofollow is done by the transition condition amountQ 10000.The communication with RA and LA is done using synchro-nised invoke activities, blocking until the response has ar-rived. The decision whether to accept or reject the appli-cation is stored in a variable accept. The value of thevariable is obtained from the response from LA, or, if theautomatic assessment predicted a low-risk case, through anexplicit assignment. The reply to the client is modelled us-ing reply activities to make sure the client can associatethe asynchronous reply with the initial request. In case thewithdrawn amount is invalid, an error message is constructedin variable message and is subsequently forwarded to theclient.

Note that the BPEL description contains an error: if thewithdrawal fails, the risk assessment is not undone. Thisis because the fault handler does not include the explicitcompensate action to trigger the rollback.2

3 Modelling Web Services for DiagnosisIn this work we are primarily concerned with locating faultsin business process models and Web service descriptions. Incontrast to previous work [Ardissono et al., 2005], our goalis to locate errors in the description of a service coordinationrather than to identify a faulty service in a groups of interact-ing processes. While [Ardissono et al., 2005] employ localreasoners to monitor the Web service execution and eliminateimpossible explanations, we assume a more centralised ap-proach where a single coordination template is the focus ofinterest. Information obtained from a failing service execu-tion together with descriptions of the interaction protocols ofthe peer services involved in the execution allows us to derivepossible explanations for the misbehaviour in the specifica-tion of the local service.

In principle, the same techniques employed to debug com-puter programs [Stumptner and Wotawa, 2000; Mayer andStumptner, 2002] may seem suitable for analysing BPEL de-scriptions. On deeper analysis, however, it becomes evidentthat there is a fundamental difference between computer pro-grams and BPEL “programs”: while the effects of arbitraryprograms can be derived from the program’s structure andthe semantics of the language constructs, actions in BPELspecification are usually described on a very high level whichis not directly amenable for analysis or execution. Conse-

2While BPEL provides default fault and compensation handlersthat would trigger compensation actions in case the exception es-capes unprocessed, the explicitly specified fault handler disables thisfunctionality.

quently, models based solely on the propagation of values be-tween statements are not effective and result in many possibleexplanations.

To overcome this limitation, we propose to incorporate in-formation about the messages passed between communicat-ing processes to eliminate explanations that imply lost mes-sages or blocked processes. In addition, concrete values ob-tained from successful and failing executions of the serviceare exploited to derive contradictions and focus the search torelevant paths through the process.

3.1 Modelling ConstraintsTypically, models created for diagnosis purposes are createdstatically, considering only the structure of the system andthe flow between components. This approach has provedto be successful and often allows for optimising the per-formance of the diagnosis engine by pre-compiling parts ofthe model description [Darwiche, 1999; Frohlich and Nejdl,1997]. This approach has also been applied to computer pro-grams [Stumptner and Wotawa, 2000; Mayer and Stumpt-ner, 2002], but is limited to deterministic execution paths.In particular, presence of loops requires a meta-layer whichdynamically modifies the model to accommodate additionaliterations. More importantly, the models assume that the dataflow between components can be determined statically, whichmakes them unsuitable for expressing concurrent executions.

Much effort has been invested in modelling different as-pects of concurrent systems as automata or transition systems,applying various forms of state space reductions to keep themodels small [Corbett, 1998]. A prerequisite of many reduc-tion techniques is that all possible transitions between loca-tions in the system are known. Unfortunately, the presenceof externally triggered termination of processes, as is presentin BPEL, thus renders many standard techniques unusable forour purposes.

Prior to describing a model that can be used for debuggingBPEL descriptions, we first take a look at the constraints themodelling process must adhere to:• Business processes and Web services are not purely

computational services that can be invoked arbitrarily,but also may have some effects on the real world. For ex-ample, a commercial service would charge a fee. Thus,the diagnostic process should not rely on the repeatedexecution of a service, but exploit the information ob-tained through a single (or a small set of) executions.

• The assumption that a precise description of the (correct)operation of the service would be available is somewhatunrealistic. While description languages such as BPELand OWL-S gain popularity, generally only a partial de-scription of message sequences and preconditions is pro-vided; a precise specification at a level that could be usedfor diagnosis is usually not available (yet). Thus, the di-agnosis engine should not assume the presence of speci-fications beyond what is provided through the executionto be debugged and a specification of the expected re-sults.

• In contrast to many programming languages and di-agnostic models, BPEL processes operate on complexmessages containing structured but dynamically gen-erated data, such as XML documents. Consequently,


building precise models that can not only check butalso predict values becomes more difficult in the generalcase.

• Concurrent execution and termination of processes areessential features of BPEL and must be supported to acertain degree.

3.2 Dynamic ModelTo overcome the limitations discussed in the previous sec-tion, we propose a dynamic modelling approach where themodel is not derived statically, but built by taking the infor-mation provided by observations and test cases into account.First, the process description to be diagnosed is modified toreflect the current fault assumptions of the diagnosis engine.Then, an abstract execution engine tries to find a path that is(a) feasible given the observations and (b) the modified pro-cess specification leads to a state where all observations aresatisfied. The behaviour implied by the process specificationneeds to be modelled only to the extent where it is consis-tent with the constraints given by the test case. A conflict isderived when a consistent model cannot be found. Using (in-cremental) dependency tracking, a conflict set can be derivedwhile the infeasible trace is constructed.

Definition 1 (Debugging Problem) A BPEL debuggingproblem DP is a tuple 〈BP, TC,COMP 〉 where BPdenotes the BPEL specification to be debugged, TC denotesa set of test case specifications, and COMP represents theset of diagnosis components in terms of elements of BP .

BP can be seen as a template describing all possible execu-tions, where the elements that appear in COMP may be sub-stituted with “holes” representing arbitrary behaviour, corre-sponding to fault assumptions made by the diagnosis engine.Typically, elements of COMP would be activities, transitionconditions, or even entire scopes. For our purpose, “arbitrarybehaviour” is defined as assigning unspecified values to visi-ble variables3, continue normally, send or receive messages,or throw an exception. TC describes the state of the envi-ronment and the inputs provided to BP , together with theexpected result and any other constraints that every valid ex-ecution must satisfy. In the following we limit our attentionto test specifications which fail at runtime. The integrationof passing tests in this framework can be done by alteringthe likelihood of certain faults, given the observed correctruns [Jones et al., 2002].

3.3 Activation ModelTo simplify modelling, we abstract from the concrete syn-tax of BPEL and represent the process specification as nestedgraphs. The core of BPEL is built around the notion of exe-cution scopes. A scope contains a process specification, to-gether with a fault handler and a compensation handler. In thesimplest form, a scope contains a single activity. Dependen-cies between scopes are modelled through links, which definea partial execution order. The execution of a scope can be in-terrupted either internally, through an uncaught exception inone of the nested scopes, or externally, when a nested scope isterminated due to an exception in an enclosing scope. When

3Each variable is visible in the scope where it is defined and inall nested scopes.

a scope S is terminated, all scopes enclosed within S are ter-minated immediately, possibly executing fault and compen-sation handlers. Fault handlers are also represented as activ-ities in a scope, which become active as soon as an activityin the scope raises an exception. Compensation handlers areembedded within the enclosing scope’s fault handler.

Definition 2 (Scope) A scope S is a tuple 〈N,O, V, F,C〉where N denotes a set of actions and nested scopes directlycontained in S, O ⊆ N × N denotes the partial executionorder between elements in N , V denotes a set of variablesvisible in S and all contained elements, F the fault handler,C the compensation handler. For all nested scopes N , S isthe parent of N .

For simplicity, we assume that there is only one fault handlerin each scope. Note that our notion of “scope” does not totallyagree with the standard BPEL definition.

To keep our model elements simple, we model join condi-tions of activities and transition conditions as separate actionsj without side-effects, apart from determining the activationof the component for which j is specified. In the following,we assume that this transformation has been applied and omitconditions from actions and links. O specifies a partial execu-tion order in the sense that s may be considered for executiononly if, for all links si → s ∈ F , the status of all si has beendetermined (see below).

For each scope S, we generate a number of artificial vari-ables in the enclosing scope that represent the current statusof S: (+ denotes “yes”, − denotes “no”).

• Sactive=+ if the scope is ready for execution, − other-wise.

• Sdone=+ if S has finished executing, − otherwise.

• Sabort=+ if the scope containing S forces S to abort(due to a failure in a sibling or parent of S),− otherwise.

• Sexc=+ if S has thrown an exception, − otherwise.

Sactive and Sabort must be defined before S is executed,while Sdone and Sexc are defined only after S has com-pleted execution. The reason for having two separate vari-ables Sactive and Sabort is that the execution engine must beable to distinguish the case where S was interrupted whileexecuting from the case where S was never executed becauseof earlier termination of the enclosing scope. Simply settingSactive=− would not work because that would lead to a spu-rious contradiction where S had already been active.

A snapshot E of the current state of execution of BP at anytime can be obtained by capturing the values of all variablesin all active scopes. Whenever a scope S is scheduled for ex-ecution, the variables defined in S are instantiated in E (withinitial value , “uninitialised”) and removed again after Shas completed. Activation variables S∗ are handled speciallyin that Sactive=S′

active where S′ is the parent scope of S.The activation variables associated with each scope identifythe current progress of execution. This corresponds roughlyto a variable environment found in traditional programminglanguages. The difference is that here the program counterfound in sequential execution is encoded in the S∗ variables.

As mentioned previously, a conflict has been derived if nofeasible execution satisfying all constraints in BP ∪ TC canbe found:


Definition 3 (Conflict) A set c1, . . . , ck ⊆ COMP is a con-flict set for DP if

ABc1 · · · ABck(BP ) |= ⊥,

where ⊥ denotes impossible behaviour and the ABcidenote

mutation functions which modify BP to reflect the abnormal-ity of the source expressions in BP corresponding to ci. denotes function composition.

In contrast to conventional execution, the presence of faultassumptions precludes us from assuming that every executionis deterministic. Instead, the execution engine must be able tofollow multiple paths even if every concrete execution of BPwas deterministic. For example, it may be necessary to followmultiple paths in a flow construct even in case the transitionexpressions are complementary, simply because with somevariable values unknown, the expressions do not evaluate to aunique value. Another example is if we assume the receiveactivity in our running example is assumed abnormal. Then,the value of amount is undefined and the abstract executionengine must analyse both paths.

The basic debugging algorithm is outlined as follows: themodel is simulated using the test specifications in TC andconflicts are computed from failing test runs. From conflicts,possible explanations are derived, each corresponding to analtered version of the BPEL process, where the precise be-haviour of the process is replaced with a loosely constrainedone. Each of the candidates is subsequently simulated againto eliminate impossible explanations.

3.4 Data ModelTo be able to carry some information even in the case whereprecise values cannot be derived, the Abstract Interpreta-tion [Cousot and Cousot, 1977] framework provides us withthe means to predict values. Abstract Interpretation worksby substituting the precise effects of each language elementwith an approximation thereof, which operates over a (finite)abstract domain AD. Thus, an approximation of the true be-haviour can be computed in a finite amount of time. For ourpurposes, a simple abstraction which either predicts a con-crete value or does not predict any value for a BPEL variable,is sufficient. For the status variables S∗, a power set (or inter-val abstraction) is used. This does not impede efficiency, asthe domain of these variables is small.

The behaviour of a simple scope S containing only a nor-mal action A reflects the effects as defined in BP and theBPEL language specification, operating on the abstract do-main AD. For example, the expression amount < 10000evaluates to true or false in a snapshot E if E(amount) isa concrete value. Otherwise, the result is also undefined.

In contrast to the simulation of programming languageslike Java, the descriptions of the BPEL activities are not de-tailed enough to actually predict new values given the inputvalues. Instead, we must rely on the values derived throughthe actual execution TE of the test case. However, those val-ues are only guaranteed to be valid in executions where theinvolved variables Vi and peer processes Pi exhibit the samestate as in TE. Otherwise, no values can be predicted. If thisguarantee is available, the values of Vi can be directly com-pared with the value in E. For every peer Pi, it is necessary totrack the messages sent and received to and from Pi to ensure

the results are the same as the ones obtained in TE. This canbe done by introducing an additional variable MSi for eachPi which acts as an index into the message sequence MSTE

ithat was obtained from TE. The value of MSi is updatedwhenever a send or receive action involving Pi is executed.MSi is incremented if the sent or received message matchesthe next one in MSTE

i , or undefined otherwise.While this value predictor is quite weak, it can still be ef-

fective for limiting the search space of feasible executions incase the fault assumptions are near the end of the execution orif the messages passed between BP and Pi are independentfrom messages to another Pk.

3.5 Complex BehaviourThe forward behaviour of an individual action A is formalisedas transfer functions, taking a process snapshot E as input andcomputing a set of new process snapshots E as result, whichreflect possible effects of A in E. Typically, E contains asingle snapshot, except for actions which may trigger excep-tions. In this case, E contains two snapshots; one reflectingthe normal path, while the other snapshot corresponds to BPfollowing the exception path. In case A is assumed abnor-mal, a generic snapshot containing unspecified values for theset VA of BPEL variables accessible at A is returned, as wellas an exception snapshot.

For example, consider the assign X:=Y+1 activity4

(abbreviated “A”):

[[A]](E)=

8>>>>>><>>>>>>:

E[X←E(Y )+1, Adone←+]if E 6=⊥ ∧+ ∈ Aactive ∧+ /∈ Aabort

∧+ /∈ Adone ∧ ¬AB(A)

E[VA←>], E[Sexc←>, Sabort←+]if E 6=⊥ ∧AB(A)

E otherwise

where E[X←E(Y )+1] denotes the substitution of a newvalue for X in E and VA denotes the set of all variables ac-cessible at A. > denotes the unknown value. The first clausespecifies the normal behaviour if A is active and ready for ex-ecution, the second clause specifies the abnormal behaviour.The last clause catches all cases where either A is not readyfor execution, or E is infeasible (i.e. that path can never berealised).

Similar to [[A]](E), the abstract effects of each activity inBP can be derived from the BPEL semantics and the struc-ture of BP .

Each scope S in BP must respect flow constraints betweenits activation variables S∗: S can complete normally iff S wasactive and S was not aborted (internally or from the outside).

Sactive=− ∨ Sabort=+ ∨ Sexc 6=− ⇔ Sdone=−Sactive=+ ∧ Sabort=− ∧ Sexc=− ⇔ Sdone=+

An activity may be activated if the status of all precedingactivities in the partial execution order O has been determinedand at least one of the activities has completed normally:

Sactive=

8>>>>><>>>>>:

S′active if 6 ∃

〈S′′,S〉∈F

and S′ is the parent of S

+ if ∀〈S′,S〉∈F

S′done 6= ∧ ∃

〈S′,S〉∈FS′

done=+

− if ∀〈S′,S〉∈F

S′done=− ∧ ∃

〈S′,S〉∈F

otherwise

4For brevity, we abstract from the concrete XML and XPath rep-resentation of the actual BPEL description.


Some BPEL constructs may enforce additional constraints.For example, the pick action enforces that only one of thesuccessor transitions must be active at any given time. Inpractice, the values of some of the S∗ variables may not beknown precisely and over-approximation must be performed.(This is the reason for the rather unusual notation + /∈ Aabort

above: Aabort may have a value v ⊆ +,−,).To compute the effects of a scope S consisting of multiple

activities, a worklist algorithm is applied to compute the pos-sible result process snapshots. Starting with the initial pro-cess snapshot E, all nested scopes S′ ∈ S are chosen suchthat + ∈ E(Sactive) and new snapshots E=

⋃S′ [[S′]](E) are

computed for all S′. All E′ ∈ E replace E in the worklist. Forcyclic control flow, for example loops, a fixpoint algorithm isapplied. This is guaranteed to terminate, as the abstract do-main for each variable is finite.

To account for external terminate events, all snapshots af-ter each [[S′]] is applied must be combined together to sim-ulate termination of the scope at any point in the execution.This provides a safe, but coarse over-approximation of thetrue behaviour. This is refined later in case it is discoveredsubsequently that the scope cannot be interrupted externally.To improve precision slightly, grouping of snapshots is doneaccording to the values of Sabort and Sexc, keeping the pathsthat terminate early separate from those that complete nor-mally.

The purpose of the initial forward analysis is to determineif each scope may be terminated either internally or exter-nally. The model is then refined in subsequent passes to takethat information into account and eliminate spurious paths.If an internal terminate event T , such as a throw activity isencountered, all S′

abort=+ where S′done 6=+ for all scopes S′

that do not strictly precede T in the partial execution order O.Once the scope S has been analysed, Sdone is set to + in

the snapshot corresponding to the normal completion of S,and to − in the exception case. The snapshots correspondingto normal completion and exception exits are then propagatedto the parent of S for further processing. In case the propa-gated snapshot has Sexc set, the fault handler becomes active.In case the exception is not re-thrown to an outer scope, Sexc

and Sdone are both set to −.

3.6 ObservationsTo incorporate observations about the state of processes inBP or TC, the values of observed values in snapshots cor-responding to the observation location are intersected withthe observed values, potentially leading to a conflict. Exe-cution paths not satisfying these values are eliminated fromthe model by applying a similar procedure as described insection 3.5 in backward direction, intersecting the snapshotsobtained thorough backward reasoning with the values de-rived from forward reasoning [Bourdoncle, 1993]. In case ⊥is derived, the entire path is eliminated from the model andthe analysis resumes with a different branch, until no morebranches can be eliminated. A conflict has been derived ifthere is no feasible execution from the start node in the modelto the final state specified in TC. The forward and backwardanalysis are repeated until a fixed-point (or a specified limitof iterations) has been reached.

After the backward phase, the subsequent forward analysismay transform the model to obtain a more precise abstrac-

tion.5

• In case Sabort=− for a scope S, it is known that noexternal termination event is received. Therefore, thecoarse approximation of combining the execution snap-shots after every activity is carried out is not necessary,leading to a better approximation.

• Similarly, if Sexc=−, the fault handlers of S′ containingS need not be considered.• For cyclic control structures, if the snapshot before the

cycle is inconsistent with the snapshot after the cycle,the cycle must execute at least once to obtain a consistentpath. This unrolling can be done to adaptively refine amodel and either improve the approximation of the truebehaviour, or eliminate the entire path.

• Values propagated from the observations may contra-dict certain concurrent execution schedules for interfer-ing components. The corresponding join operations ofsnapshots can then be eliminated.

3.7 Communication RequirementsSo far, only the values monitored during the execution ofTC and the observations specified in TC have been consid-ered for modelling and diagnosis. With the weak predictionscheme described in the previous section, this is not sufficientto obtain a sufficiently small number of diagnoses candidates.

To eliminate spurious candidates, we make use of inter-actions recorded through the initial execution of TC and/orspecified in TC. For our purposes, we assume that the mes-sages passed between BP and a peer process Pi are knownand their sequence MSTE

i is fixed and correct. While thisassumption may seem strong for general processes, in thecontext of a single BP run it seems to be reasonable if itis assumed that the peer processes would reply with an errormessage in case their protocol is not followed properly. Asmentioned before, we assume that a newly developed processdescription can rely on the other descriptions to be correct,which is not unreasonable if it is considered that only ser-vices that have proved to be reliable would be used to developa new process.

In addition to tracking values, we introduce separate vari-ables seeni to keep track of the progress of sent and receivedmessages for each process Pi. As described previously, theprogress can be represented as an index (or as interval in casethe value is not known precisely) in the message sequencefor pre-specified sequences in TC. For processes where mes-sage sequences are not specified, we assume that for everymessage that is sent there must be a matching receiving pro-cess, and vice versa. Thus, no messages should be lost norany process should be caught in a waiting state forever.6

Keeping track of the messages allows to exclude paths andconcurrent execution schedules that would not satisfy the ob-served message sequences from our analysis and prune thesearch space early. This is possible in particular if the in-teraction protocols include different method types in subse-

5With the exception of the concurrency-related improvements,the transformation are essentially those described in [Mayer andStumptner, 2004].

6Extensions incorporating timeouts and timed alarms are left forfuture work.


quent steps, as misaligned send or receive operations can bedetected early rather than at the end of the analysis.

3.8 Diagnosing an ExampleContinuing the Loan Assessment example, assume we havea test case that specifies the following sequence of operationfor a hypothetical client:

1. A loan application for $100 is requested.2. The answer is received that the application was ap-

proved.3. The client requests to withdraw $5000.4. The request is denied.

Assume further that the test case also expects the risk as-sessment to be cancelled because of the failed withdrawal at-tempt. For this purpose, RA provides an optional WSDL portthat invokes RA’s compensation handler. Further, we knowthat the interfaces of LA and RA differ, i.e. the WSDL porttypes are different.

We also observe that the loan assessor is not used. Theset COMP contains all activities A, . . . ,K in BP . In thefollowing we do not follow the exact forward- and backwardreasoning procedure, but short-cut the computation as suitsthe illustration.

When the test case is run, we observe that the client re-ceives the expected messages, but the credit assessment is notrolled back (RA::risk 6=). Therefore, we find that the exe-cution does not satisfy the test specification and a contradic-tion has been found.

Starting the diagnosis process, assume the invoke ac-tivity B communicating with RA is abnormal. Therefore,after starting the analyser, the initial snapshot E0 containsthe variables of FI: E0(message)=, E0(accept)=,E0(amount)=, E0(Aactive)=+, and E0(Aabort)=−.Also, no messages have been sent or received to and fromany peer: E(seeni=0) for i ∈ RA,LA,Cl.

A is assumed normal. The initial client request is re-ceived and, because the message history is the same as forthe test execution, the values must be the same. Therefore,E1=E0[amount←100]. The (normal) transition conditionsare evaluated and it is determined that E1(Bactive=+).

As there is no other active activity, B is selected for execu-tion. As B is assumed abnormal, neither the communicationevents nor the variable assignments need to occur as in theoriginal execution. We obtain two possible successors; onefor the normal completion/fail stop case:

E2=E1[X←>, Bdone←>, Bexc←−, seeni←>, seenCl←[1, 4]],

and one for the exception case:

E′2=E1[X←>, Bdone←−, Bexc←+, seeni←>, seenCl←1],

with X=message, accept, amount and i ∈ RA,LA.On analysing the exception case E′

2, it is determined thatbecause the scope does not contain a fault handler, it isaborted. This case contradicts the expected behaviour andE′

2 is eliminated.Therefore we continue with E2. It is determined that be-

cause the value of E2(risk) is not known, both paths are fea-sible, implying Bdone=>, Cactive=> and Dactive=>.

Because the Loan Assessor is not to be used, i.e.MSLA=∅, activity C must find a different partner. As RA

and LA offer incompatible interfaces, RA cannot be a target.Assume that a pair of client ports suits the interface descrip-tion: seenCl=[1, 3] (the first message in the client sequencewas consumed by the initial receive and the next messagemust be a send operation). Continuing the propagation, D isexecuted and assigns the value “yes” to accept.

Because both C and D could possibly be active at thesame time, the result must be combined into a single re-sult E3 and the preceding process repeated until a fixpointis reached. This leads to an environment where the values ofrisk, accept, are unknown and the seeni variables have value[1, 4], and activity E is activated.

As E is not abnormal, the message matches either the sec-ond or the last client message, resulting in E4(seenCl)=[2, 4]and the nested scope being active. As the only active opera-tion at this time is F , we must find a message in the sequencethat matches a receive operation. Here, only the third mes-sage is valid and we obtain seenCl=3.

After the forward analysis has completed, a backward passis done and the seenCl=3 is propagated backwards to E,which at this point must match the second client message.Propagating this further up to C and D, it is derived that thereis no valid message left to choose for C; thus the path is in-feasible and is removed. Continuing propagation with D andsubsequently B, we derive that the abnormal B must accountfor the two messages exchanged with RA. As the commu-nication history for the two messages is the same as in thefailing test run, the same result must be derived and we ob-tain that RA::risk=low. Again, this is a contradiction withthe test specification and, therefore, the abnormal B cannotexplain the behaviour of the test run.

Continuing with the remaining components, we obtain foursingle fault diagnoses: either I or J may be faulty in thatthey should issue a message to RA invoking the compensa-tion handler, or F or H should abort in addition to performingtheir normal actions.

4 Related workOur work complements the work in [Ardissono et al., 2005]which concentrates on the definition of decentralised interac-tions between individual diagnosers for each Web service inan cooperative orchestration, but does not examine the indi-vidual diagnosis engines or their communication strategies indetail. An earlier approach dealt with generic component-based software systems while considering the individualcomponents as black boxes [Grosclaude, 2004]. We focuson the actual orchestration mechanisms, using BPEL4WS asan exemplary choreography language.

Considerable effort has been spent on the use of seman-tic service descriptions for more or less automated com-position of services, e.g., [Narayanan and McIlraith, 2002;Pistore et al., 2005], but so far, detailed error handling hasonly been addressed at the level of fault handler implementa-tion [Chafle et al., 2005], or, at best, verification [Kazhami-akin and Pistore, 2005].

5 Conclusionin this paper we consider the task of diagnosing cooperativebusiness processes specified in BPEL4WS, building on andextending earlier work on imperative languages [Mayer and


Stumptner, 2002; 2004], while incorporating BPEL-specificconcerns such as concurrency and nondeterminism. We fo-cus on the development of the top level process itself, assum-ing that pre-existing remote services accessed in the collab-oration are less likely to be the source of the fault than theirspecific usage and the choreography itself (although the indi-vidual component services are amenable to being examinedin a hierarchic diagnosis process). This work only scratchesthe surface, with many aspects not yet considered, such astimeouts and timed alarms, deeper analysis of data structures(XML trees that are accessed using XPath – this would besubject to earlier work on debugging in functional languages),and the vast space of possibilities opened up by the incorpo-ration of Semantic Web service specifications in OWL-S orother formalisms. These could be incorporated in the diag-nosis process in similar fashion to pre- and postconditions inJava debugging [Stumptner, 2005]. Ultimately, reconfigura-tion and planning could be incorporated to effect Web Servicerepairs.

References[Alonso et al., 2004] Gustavo Alonso, Fabio Casati, Harumi

Kuno, and Vijay Machiraju. Web Services. Springer-Verlag, 2004.

[Ardissono et al., 2005] L. Ardissono, L. Console, A. Goy,G. Petrone, C. Picardi, M. Segnan, and D. TheseiderDupre. Cooperative model-based diagnosis of web ser-vices. In Proceedings of the Sixteenth International Work-shop on Principles of Diagnosis, Monterey, June 2005.

[Bourdoncle, 1993] Francois Bourdoncle. Abstract debug-ging of higher-order imperative languages. In Proceedingsof the SIGPLAN Conference on Programming LanguageDesign and Implementation, pages 46–55, 1993.

[Chafle et al., 2005] Girish Chafle, Sunil Chandra, PankajKankar, and Vijay Mann. Handling faults in decentralizedorchestration of composite web services. In Proc. ICSOC,pages 410–423, Amsterdam, 2005. Springer-Verlag.

[Christensen et al., 2001] Erik Christensen, FranciscoCurbera, Greg Meredith, and Sanjiva Weerawarana. WebServices Description Language (WSDL) 1.1. Technicalreport, World Wide Web Consortium, March 2001.

[Corbett, 1998] James C. Corbett. Using shape analysis toreduce finite-state models of concurrent Java programs.Technical report, Department of Information and Com-puter Science, University of Hawaii, 1998.

[Cousot and Cousot, 1977] Patrick Cousot and RadhiaCousot. Abstract interpretation: A unified lattice modelfor static analysis of programs by construction of approx-imation of fixpoints. In POPL’77, pages 238–252, LosAngeles, 1977.

[Curbera et al., 2003] Francisco Curbera, Yaron Goland, Jo-hannes Klein, Frank Leymann, Dieter Roller, SatishThatte, and Sanjiva Weerawarana. Business Process Ex-ecution Language for Web Services 1.1. Technical report,IBM and Microsoft, May 2003.

[Darwiche, 1999] Adnan Darwiche. Compiling knowledgeinto decomposable negation normal form. In Proc. 16th

IJCAI, pages 284–289, 1999.

[Frohlich and Nejdl, 1997] Peter Frohlich and Wolfgang Ne-jdl. A Static Model-Based Engine for Model-Based Rea-soning. In Proceedings 15th International Joint Conf. onArtificial Intelligence, Nagoya, Japan, August 1997.

[Grosclaude, 2004] Irene Grosclaude. Model-based monitor-ing of component-based software systems. In Proceedingsof the Fifteenth International Workshop on Principles ofDiagnosis, Carcassonne, June 2004.

[Jones et al., 2002] James A. Jones, Mary Jean Harrold, andJohn Stasko. Visualization of test information to assistfault localization. In Proceedings of the 24th InternationalConference on Software Engineering, Zurich, Switzerland,September 2002.

[Kazhamiakin and Pistore, 2005] Raman Kazhamiakin andMarco Pistore. A parametric communication model for theverification of bpel4ws compositions. In EPEW/WS-FM,LNCS, pages 318–332, Versailles, 2005. Springer-Verlag.

[Mayer and Stumptner, 2002] Wolfgang Mayer and MarkusStumptner. Modeling programs with unstructured con-trol flow for debugging. In Proc. 15th Australian JointConf. on AI, pages 107–118, Canberra, December 2002.Springer-Verlag.

[Mayer and Stumptner, 2004] Wolfgang Mayer and MarkusStumptner. Debugging program loops using approximatemodeling. In Proc. ECAI, Zaragoza, August 2004.

[Narayanan and McIlraith, 2002] Srini Narayanan andSheila A. McIlraith. Simulation, verification and auto-mated composition of web services. In Proceedings of the11th International Conference on the World Wide Web,pages 77–88. ACM Press, 2002.

[owl, 2004] OWL-S, 2004.http://www.daml.org/service/owl-s/.

[Pistore et al., 2005] Marco Pistore, Paolo Traverso, Pier-giorgio Bertoli, and A. Marconi. Automated synthesis ofcomposite bpel4ws web services. In ICWS, pages 293–301, Orlando, 2005.

[Stumptner and Wotawa, 2000] Markus Stumptner andFranz Wotawa. Using Model-Based Reasoning forLocating Faults in VHDL Designs. Kunstliche Intelligenz,14(4):62–67, 2000.

[Stumptner, 2001] Markus Stumptner. Using design infor-mation to identify structural software faults. In Proc. 14th

Australian Joint Conf. on AI, Springer LNAI 2256, pages473–486, Adelaide, December 2001.

[Stumptner, 2005] Markus Stumptner. Web service composi-tion. In 4th International Conference on Information Sys-tems Technology and its Applications (ISTA’05), Palmer-ston North, NZ, May 2005. GI - Gesellschaft fuer Infor-matik.


Abstract In fault diagnosis, the integration between the fault model-based detection and isolation modules plays a significant role. Thus, model-based fault detec-tion methods have inherent problems as the lack of fault indication persistence, noise sensitivity or model errors which cause fault detection not to be as good as it is required. Consequently, the fault isolation module may be confused in case the asso-ciated residual should be evaluated along with a set of residuals. In the present work, interval observers are used in the fault detection module. Then, the ef-fect of the observer gain in the fault indication per-sistence is recalled and extended to the fault isola-tion task. A case study based on real data taken from the Barcelona’s urban sewer system limnime-ters is used to illustrate this effect. Keywords: Fault detection, Fault isolation, Robust-ness, Intervals, Observers.

1 Introduction These last years, the integration between fault detection and fault isolation tasks in model-based fault diagnosis has been a very active research area (see among others [Combastel et al., 2003], [Pulido et al., 2005] or [Puig et al., 2005]). The typical binary interface between these two modules has been improved using additional information. One example is the algorithm presented in [Pulido et al., 2005], where the next aspects are taken into account: residual sign, fault residual sensitivity and fault order. The core of this fault-isolation algorithm for the non-uncertain system case has been pro-posed in [Puig et al., 2005].

However, the model-based fault detection tasks are still the Achiles talon since the noise sensitivity model errors and the lack of fault indication persistence may cause fault de-tection not to be as good as it is needed. Consequently, the fault isolation module may be confused: especially, in case the associated residual set is not diagonal with respect to faults being necessary that a subset of them is active at the same time instant to isolate a given fault.

When using observer-based fault detection methods, the observer gain plays an important role because it determines the time evolution of the residuals, their sensitivity to a fault and therefore the minimum detectable fault at any time in-stant [Chen and Patton, 1999]. Thus, the fault indication persistence depends on the observer gain [Meseguer et al., 2006] and on the other hand, according to this observer gain effect, a fault can be permanently detected, non-permanently detected or non-detected [Meseguer et al, 2006].

The above mentioned fault detection problems and their influence on the fault isolation module have been already noticed by [Combastel et al., 2003] who suggests registering the maximum residual value once reached. However, this strategy introduces an additional problem, since then, it is not possible to know when the fault disappears. Another approach would be using structured residuals in a diagonal form [Gertler, 1998] but it might be too complicated in this case because of the parameter uncertainty. When using an interval observer method, the effect of those fault detection problems might be partially avoided designing properly the observer gain matrix and therefore, the fault isolation result might be also improved.

The goal of this work is to show how different fault isola-tion results are obtained depending on the observer gain for a given fault scenario in spite of using an accurate fault iso-lation algorithm: right persistent fault isolation, right non-persistent fault isolation, wrong fault isolation, lack of fault isolation. The interval observer-based fault diagnosis algo-rithm will be applied to the limnimeters of Barcelona’s ur-ban sewer system. Regarding the structure of the remainder of the paper, passive robust fault detection using interval observers is presented in Section 2. In Section 3, the integration of ro-bust interval observer methods with the fault isolation algo-rithm is discussed. In Section 4, for a given limnimeter fault scenario, several observer gain sets are used in order to show their influence on the resulting fault isolation. Finally, in Section 5, the main conclusions are presented.

Observer Gain Effect in Linear Interval Observer-Based Fault Isolation

Jordi Meseguer, Vicenç Puig, Teresa Escobet, Joseba Quevedo Automatic Control Department - Campus de Terrassa

Universidad Politécnica de Cataluña (UPC) Rambla Sant Nebridi, 10. 08222 Terrassa (Spain)

[email protected]


2 Passive robust based fault detection using interval observers

2.1 Input/output interval observer expression Considering that the system to be monitored can be de-scribed by a MIMO linear uncertain dynamic model in dis-crete-time, then its input-output relationship, without faults, disturbances and noise, is

1( ) ( , ) ( )k q k−=y M θ u (1)

where 1( , )q−M θ is the transfer function matrix expressed

using the shift operator q-1 and Θθ∈ is a set of interval bounded parameters representing the model uncertainty:

| p θθθθΘ ≤≤ℜ∈= . This type of model is known as an interval model.

Instead of using directly the model of system given by (1) to detect faults, the following observer form will be used:

ˆ ˆ( 1) ( ( ) ( )) ( ) ( ) ( ) ( )ˆ ˆ( ) ( ) ( )

k k k kk k+ = − + +=

x A θ WC θ x B θ u Wyy C θ x

(2)

where matrices A(q-1), B(q-1) and C(q-1) are obtained from the system model (1) using its observer canonical state space form and W is the observer gain, designed to stabilize the matrix

0 ( ) ( )= −A A θ WC θ and to guarantee a desired

performance for all Θθ∈ regarding fault detection and avoiding the wrapping effect.

The effect of the uncertain parameters θ on the observer temporal response will be bounded using an interval: [ ˆ ( )ky , ˆ ( )ky ], where:

ˆ ˆ( ) min( ( , ))i iy k y k∈

=θ Θ

θ

ˆ ˆ( ) max( ( , ))i iy k y k∈

=θ Θ

θ

Such interval can be computed using the algorithm pre-sented in [Puig et al., 2003].

The observer given by (2) can also be described by the following input-output relationship for each output:

( )

uy iji u

yi

y ui ij

nn n

i iv i ijv jv 1 j 1v 1

n

iv i iv 1

in ijni1 ij1

ˆ ˆy ( k ) a y ( k v ) b u ( k v )

ˆw y ( k v ) y ( k v )

a , ,a ,b , ,b

= = =

=

= − + − +

+ − − −

= ∈

∑ ∑∑

∑

θ ΘL L

(3)

where iy ( k ) is a given measured system output and j ( k )u is

a given measured system input. Moreover, yin determines the model order associated to the system output ‘i’, nu is the number of the system inputs and uijn is the needed number

of past values of the input ‘j’ regarding the output ‘i’. When there is no fault, each of the system outputs veri-

fies: ˆ ˆ( ) [ ( ), ( ) ]i ii

y k y k y k∈ (4)

Equivalently, this observer can be expressed in transfer function form using the shift operator q-1 and assuming zero initial conditions as:

u 1n 1ij i

i j i1 1 1 1j 1 i i i i

G ( q ) W ( q )y ( k ) u ( k ) y ( k )

H ( q ) W ( q ) H ( q ) W ( q )

− −

− − − −=

= ++ +

∑ (5)

where: uijn

1 vij ijv

v 1G ( q ) b q− −

=

=∑ , yin1 v

i ivv 1

H ( q ) 1 a q− −

=

= −∑ , yin1 v

i ivv 1

W ( q ) w q− −

=

=∑

In case of all observer gains ivw are zero, the observer is in fact a simulator [Chow et al., 1984] but if the condition

iv ivw a= is fulfilled, the observer becomes a predictor [Chow et al., 1984].

2.2 Fault detection using an interval observer Fault detection is based on generating a residual comparing the measurements of physical variables ( )iy k of the process with their estimation ˆ ( )iy k provided by the associated sys-tem observer:

i i iˆr ( k ) y ( k ) y ( k )= − (6) This residual can be also expressed in transfer function form as it follows:

u 1n1iji

i i j1 1 1 1j 1i i i i

G ( q )H ( q )r ( k ) y ( k ) u ( k )

H ( q ) W ( q ) H ( q ) W ( q )

−−

− − − −=

= −+ +

∑ (7)

leading to its computational form [Gertler, 1998]. In the used FDI interface, residual (6) is computed regarding the nominal observer model ˆ ( )o

iy k obtained using o= ∈θ θ Θ ,

o 0i ii ˆr ( k ) y ( k ) y ( k )= − (8)

In normal conditions, )(0 kri should be zero under an ideal situation at each time instant k. However, when considering model uncertainty located in parameters, the residual gener-ated by (8) will not be zero even in a non-faulty scenario. In this case, the possible values of this residual could be bounded using an interval [Puig et al., 2002]

( ) [ ( ), ( )]oooii ir k r k r k∉ (9)

where: ˆ ˆ( ) ( ) ( )o oi iir k y k y k= − and ˆ ˆ( ) ( ) ( )o o

i i ir k y k y k= − The interval for the residual constitutes an adaptive thresh-old.

2.3 Degree of residual violation As it is proposed in [Puig et al., 2005], the activation value for each residual is calculated as in the DMP-approach [Petti et al., 1990] using the Kramer function:

4

4

4

4

( ( ) / ( ))( ) 0

1 ( ( ) / ( ))( )

( ( ) / ( ))( ) 0

1 ( ( ) / ( ))

o ooi i

io oi i

ioo

oi iioo

i i

r k r kif r k

r k r kk

r k r kif r k

r k r k

φ

≥

+= − < +

(10)


In this way, residuals are normalized to a metric between -1 and 1, [ ]( ) 1.1i kφ ∈ − , which indicates the degree to which each equation is satisfied: 0 for perfectly satisfied, 1 for severely violated high and -1 for severely violated low.

2.4 Fault residual sensitivity concept The sensitivity of the residual [Gertler, 1998] to a fault is given by

ij ij

1 1iF F

j

rS ( q ) G ( q )

f− −∂

= =∂

(11)

where ij

1FG ( q )− is the transfer function that describes the

effect on the residual, ri, of a given a fault , fj . In the follow-ing, Eq. (11) is particularized to an additive output / input sensor fault using a fault sensitivity analysis. This Section focuses on the effect of the observer gain on the fault resid-ual sensitivity time evolution since the rest of FD observer properties depend on it [Meseguer et al, 2006].

The expression of the residual sensitivity to an additive output sensor fault obtained using Eq. (11) and Eq. (7) is

11

1 1

( )( )( ) ( )yi

i iF

i i

r H qS qf H q W q

−−

− −

∂= =∂ +

(12)

This expression shows the residual sensitivity to an additive output sensor fault is a time function and how its dynamics and steady-state gain is influenced by the observer gain. Then, its value at time instant k=0, i.e., when the fault ap-pears is

(0) 1yifs = (13)

independently of the observer gain. On the other hand, the steady-state value for an abrupt fault modeled as a unit-step function is given by [Gertler, 1998]

1

( )yy ii

if n

i ivv

hsh w

=

∞ =+∑

(14)

where :1

1yin

i ivv

h a=

= −∑

When all observer gains are zero (simulation case), the sensitivity value is 1, independently of time instant k How-ever, in case of the observer gains satisfy iv ivw a= (predic-tion case), the sensitivity is hi. Besides, if the model is stable and isotonic [Puig et al., 2003], it is satisfied

yin

ivv 1

1 a 0=

> >∑ (15)

then: ijf is ( ) h 1∞ = < .

In general, Eq. (12) shows the residual sensitivity steady-state value is inversely proportional to the observation gain. Regarding the sensitivity of the residual to an additive input sensor fault, the same analysis could be done and then, similar results would be obtained regarding its observer gain dependence.

2.5 Observer gain influence on fault detection The residual expression given by Eq. (7) shows the influ-ence of the observer gain on its dynamics and on its steady-state value. On the other hand, Eq. (9) sets the condition a nominal residual must fulfil in order to indicate a fault and consequently, this condition is fully affected by the used observer gain.

Moreover, according to Section 2.4, the fault residual sensitivity is deeply affected by the observer gain. At fault apparition time instant, simulators, observers and predictors have the same fault sensitivity but from that time instant, while simulators keep this good fault detection property, the corresponding one to observers and predictors is worsened, being the predictor approach the most deeply affected. That means observers and predictors are loosing their aptitude for indicating faults once the fault has occurred. On the other hand, simulators have other serious problems as their high initial condition sensitivity or their high model error sensitivity and consequently, this approach is not very suit-able for fault detection. Thus, although the fault might not be persistently indicated, the model-based fault detection approach must use observers or predictors, being the first approach the one that offers more fault indication persis-tence according to the mentioned above.

3 Fault isolation algorithm

3.1 Objective In this Section, the fault isolation algorithm integrated with the interval observer-based fault detection approach pre-sented previously is introduced. This algorithm is based on the proposed in [Puig et al., 2005] but in order to show clearly the observer gain effect on the fault isolation task, a modification is considered. In the original algorithm, the first component between the fault detection and fault isola-tion modules is an interface that is based on a memory that stores along a time window given by Tw and for each resid-ual, the time instant (kφ) in which the residual has been acti-

vated (10) (|φi(kφ)|=0.5) and the activation value (φimax) whose absolute value is maximum. Then, at the end of the considered time window, an isolation result is given based on the memory stored information and on a pattern compari-son component that is introduced in the following Section. This fault isolation algorithm needs this time window in order to avoid the mentioned fault detection problems in spite of this opens a new problem: which is the length of this time window? This approach is known as relative fault isolation. In the algorithm version considered in this paper, there is no memory component and consequently, a fault isolation result is given at every time instant: absolute fault isolation. As it was already mentioned above, this assump-tion is done in order to show the effect of the observer gain on the fault isolation result and how these gains might avoid the fault detection problems.


The efficiency of the absolute approach relies basically on a proper design of the observer gain matrix in order to avoid the mentioned fault detection problems. On the other hand, the efficiency of the relative approach based on predictors relies on determining an optimal time window Tw since in this case, residual activation lasts only few time instants as it was already mentioned. Once activated a residual because of the effect of a fault, the length of this time window Tw must be enough so that all residuals affected by this fault keep activated at least a time instant, otherwise a wrong isolation result could be set. In consequence, the Tw dependency re-garding to fault residual sensitivity dynamics can be stated.

3.2 Pattern comparison component While at least one of the residuals is activated (10) (|φi(k)|≥0.5), the pattern comparison component compares at every time instant the residuals activation values given by Eq. (10) with the stored fault patterns. Given a set of residu-als, 0

ir , and the possible faults mj ffffF ,...,...,, 21= , each 0ir

is affected by a set of faults Ff j ∈ . The fault patterns are organized according to a theoretical fault signature matrix, named FSM. An element FSMij of the matrix contains the pattern if fj is expected to affect 0

ir , otherwise it is equal to 0. Four different fault signature matrices are considered in the evaluation task: Boolean fault signal activation (FSM01), fault signal signs (FSMsign), fault residual sensi-tivity (FSMsensit), and, finally, fault signal occurrence or-der (FSMorder).

FSM01: Evaluation of fault signal appearance The FSM01-table contains the theoretical binary patterns that faults produce in the residual equations. Those patterns can be codified using the values 0 for no influence, 1 other-wise. factor01j is calculated for the jth fault hypothesis in the following way:

( )( )1

1

( ) 0101 ( )

01

n

i iji

j jn

iji

boolean kk

φ=

=

=∑

∑

FSMfactor zvf

FSM

(16)

with 0, if ( ) 0

( ( ))1, if ( ) 0

ii

i

kboolean k

kφ

φφ

== ≠

(17)

and the zero-violation-factor as 0, if 1,..., with 01 0

and ( ) 0 (18)1, otherwise

ij

j i

i n

kφ

∃ ∈ == ≠

FSMzvf

That leads to the following behaviour: expected fault signals support a fault hypothesis, unexpected fault signals are eliminated through the zero-violation-factor.

FSMsign: Evaluation of fault signal signs The FSMsign-table contains the theoretical sign patterns that faults produce in the residual equations. Those patterns

can be codified using the values 0 for no influence, +1/-1 for positive/negative deviation for every FSMsignij.

The factorsignj is calculated comparing theoretical signs to the signs of new activated residual.

1

1

( ) max( ( ( ), ),

( ( ), ))

n

j i iji

n

i iji

numsign k ckecksign k

ckecksign k

φ

φ

=

=

=

−

∑

∑

FSMsign

FSMsign

(19)

with ( ( ), )

0 ( ( )) ( )

1 ( ( )) ( )

i ij

i ij

i ij

ckecksign k sign

sign k sign sign

sign k sign sign

φ

φ

φ

=

≠ =

FSM

FSM

FSM

(20)

Then:

1

( )( ) j

j jn

iji

ksign k sign

=

=

∑

numsignfactor zvf

FSMsign

(21)

where the factor jsignzvf is defined in a similar way as in the case of 01factor , excluding those fault hypothesis that has a zero in a position where the fault signal presents a sign.

3.2.3 FSMsensit: Evaluation of fault sensitivities This evaluation component uses the residual activation val-ues ( )i kφ and computes factorsensit using the sensitivity-based FSMsensit table for weighting those activation val-ues. That approach can be found as well in the DMP-method [Petti et al., 1990]. The following equations describe how to calculate the entries FSMsensitji

( ) 0( ))

( ) 0( ( )

ij

ij

F oio

iij

F oio

i

Sif r k

r ksensit

Sif r k

r k

≥

= <

FSM (22)

where ijFS is the sensitivity associated to the nominal resid-

ual ( )oir k regarding the fault hypothesis fj and it is calcu-

lated using Eq. (11). Although fault residual sensitivity depends on time in case of a dynamic system, here the steady-state value after a fault occurrence is considered as it was also suggested in [Gert-ler, 1998]. The value of FSMsensitij describes, how easily a fault will cause the ith residual to violate its associated adaptive threshold since the larger the residual partial de-rivative with respect to the fault, the more sensitive that equation is to deviations of the assumption.

Similarly, residuals with large detection thresholds are less sensitive as they are more difficult to violate. Therefore FSMsensitij can be used to weight the activation value of different fault signals:


( )1

1

( )

( )

j

n

i iji

ijn

iji

sensit k

k sensitsensit

sensit

φ=

=

=

∑

∑

factor

FSMzvf

FSM

(23)

where the factor jsensitzvf is defined similarly as the case of factor01, excluding those fault hypothesis that has a zero theoretical sensitivity while fault signal presents an non-zero value.

FSMorder: Fault signal occurrence order evaluation In dynamic systems a fault fj symptom not appears at same time in all residuals. FSMorder table contains the order of the symptom apparition for each fault hypothesis; this order is codified using ordinal numbers, starting with ‘1’. If two fault signals appear at the same time or if they explicitly may commute their order, then they should share the same ordinal number. Fault signals that must not appear get the code ‘0’. factororder is calculated comparing the apparition order of the fault signal to the theoretical order.

( )( )1

1

( )

( )

)

j

n

i iji

ijn

iji

order k

ckeckorder k orderorder

boolean order

φ=

=

=

∑

∑

factor

,FSMzvf

(FSM

(24)

where ( ( ), )

0 ( ( )) )

1 ( ( )) )

i ij

i ij

i ij

ckeckorder k order

order k order

order k order

φ

φ

φ

=

≠ =

FSM

FSM

FSM

(25)

and ( ( ))iorder kφ is the apparition order in which the ith

fault signal (φi(k)) has been activated regarding the first activated.

Decision logic component Once the pattern comparison component has evaluated the pattern factors explained above at a given time instant k, the decision logic component gives a fault isolation decision based on those evaluation factors. The decision logic takes into account the most probable fault for each operator based on the number of coincidences between the observed pattern and the theoretical one stored in the corresponding FSM matrix. The result gives 4 measures for the confidence of this fault hypothesis.

3.3 Observer gain influence on fault isolation In fault isolation where in most of the cases, a subset of re-siduals are needed to be active at the same time in order to isolate the fault, the lack of fault indication persistence pre-sented by a residual or its associated activation value (10) and the fact the subset of residuals have different fault sensi-tivities may confuse the isolation module and in conse-quence, different fault isolation results may appear: right

persistent fault isolation, right non-persistent fault isolation, wrong fault isolation, lack of fault isolation.

Thus, according to Section 2, the fault indication persis-tence given by a residual or its associated activation value (10) is influenced by the observer gain. Then, regarding the fault isolation algorithm presented in this Section, its fault isolation result is based on evaluating at every time instant several factors (factor01, factorsign, factorsensit, factoror-der) computed using the activation values (10) of the corre-sponding residual fault hypotheses. Consequently, given these activation values are affected by the observation gain, the fault isolation algorithm result is also affected by them

4 Application example

4.1 Application description The proposed fault diagnosis algorithm has been applied to 13 limnimeters of the Barcelona urban drainage system. In this Section, a given limnimeter fault scenario is analyzed in order to show how different fault isolation results could be obtained depending on the used observer gains. In particular the following cased could appear: right persistent fault isola-tion, right non-persistent fault isolation, wrong fault isola-tion and lack of fault isolation. Limnimeters can be monitored using a rainfall-runoff on-line model of the sewerage network. One possible model methodology to derive a real-time model of this kind is through a simplified graph relating the main sewers and a set of virtual and real reservoirs [Cembrano et al., 2002]. A virtual reservoir is an aggregation of a catchment of the sewer network which approximates the hydraulics of rain, runoff and sewage water retention. Its hydraulics is given by:

S)t(I)t(Q)t(Qdt

)t(dVdownup +−= (26)

where: V is the volume of water accumulated in the catch-ment, Qup and Qdown are flows entering and exiting the catchment, I is the rain intensity falling in the catchment and S its surface. Input and output sewer levels are measured using limnimeters and they can be related with flows using a linearised Manning relation: ( ) ( )up up upQ t M L t= and

( ) ( )down down downQ t M L t= . Moreover, it is assumed that: ( ) ( )down vQ t K V t= . Then, substituting in Eq. (26) and discretising:

))k(cI)k(bL)k(aL)1k(L updowndown ++=+ (27)

where: )tK1(a v∆−= , /up v downb M K t M= ∆ and

/v downc SK M= . Using this modelling methodology, a model of the select part of the Barcelona’s sewer network is presented in Fig. 1. Its structure depends on the topology of the network and its parameters must be estimated using real data from the sensors in the network.


WWTP(BESOS)

I1

I4

u1

I5

I6

I7 I8

L39

L41

L47

L16

L56

L27 L8

L3

L19+L20

I2

q24

u2

q14

u3

q945

q946

q57

q128

q8M

q811

q11D

L11

I10

I11

q7L

q7M

u4

u5

q10M L7

I11

q11M

u4

q96

q910

q68

I12

L9

q12S

L80

TARRAGONA GATES

V1

V7

V5

V4

V3

V2

V6

V8

V10

V9

LLOBREGAT (no WWTP)

ESCOLA INDUSTRIAL

TANK

C4

C5

C6

C1

C2

V11

R2

R4

R7 R8 R11

V12

R12

MEDITERRANEAN SEA

Gate (Cx)

Rain (Ix)

Virtual tank (Vx)

Real tank (Vx)

CSO

Connectionsbetween cathments (weir type)

Water level sensor Lxx

Fig. 1. Virtual reservoir model of the Barcelona prototype net-

work This model and the measurements provided by 5 rain-gauges and 13 limnimeters allow deriving 12 residuals. Once the structure of the models for each limnimeter has been selected, the interval for the parameters will be deter-mined. Such interval model would be calibrated in order to guarantee that the predicted behaviour interval includes all the non-modelled effects. An algorithm inspired on the one proposed by [Ploix et al., 1999] will be used to provide the system parameter nominal estimation. Then, using optimiza-tion tools, the uncertainty parameter intervals of the consid-ered reduced observer are adjusted using a worst-case ap-proach until all the measured data is covered by the interval of prediction for the considered observer gain.

4.2 Fault scenarios The proposed interval observer-based fault diagnosis ap-proach has been tested using a faulty scenario affecting sen-sor L39 in which its output is zero-valued from time instant k=150. According to the binary fault signature matrix FSM01, this faulty scenario has just an influence on the residuals associated to limnimeters L39 and L41 and thus, both must be activated in order to isolate that fault. The reduced observer associated to L39 and L41 is given by

39 39 39 39 39 16 39 39 39ˆ ˆ( 1) (1 ) ( ) ( ) ( )L k a w L k c I k a w L k+ = − + + (28)

where w39 (w39 =0, simulation; w39 =1, prediction) is the associated observer gain, I16 is the rain intensity measured by the rain gauge P16 and a39=[0.496, 0.744], c39=[0.601, 0.901]. These interval parameter values are valid for the observer gains tested in this paper.

41 41 41 41 41 39 41 16

41 41 41

ˆ ˆ( 1) (1 ) ( ) ( ) ( )( )

L k a w L k b L k c I ka w L k

+ = − + +

+ (29)

where w41 is the associated observer gain and a41=[0.869, 1.063], b41=[-0.296, -0.245], c39=[0.734, 0.897]. These in-

terval parameter values are valid for the observer gains tested in this paper. First, the observer gains are set so that a right persistent fault isolation result is obtained. Then, new observer gain values are given so that the fault is almost non-detected. Finally, new values are assigned so that the fault isolation result is non-permanent and partially wrong.

a) Persistent fault isolation case In case observer gains are: w39 =0.35 and w41 =0.4, the fault isolation algorithm indicates L39 as the faulty sensor from the fault apparition time. The observer gains have been cho-sen so that none residual is activated in a non-faulty sce-nario and their values are as low as possible so that the fault does not fully contaminate the observer model. Thus, once the fault has occurred, the residual violation factor (10) as-sociated to L39 and L41 is activated while the fault lasts.

In Figure 2, the time evolution of fault detection test (9) associated to both limnimeters is drawn. This Figure shows how the limnimeter nominal residuals (8) are kept out of their associated adaptive thresholds from fault time appari-tion (k=150) and thus, both of them are indicating persis-tently that fault.

In Figure 3, the absolute value of the residual violation factors (10) associated to both limnimeters is drawn. Indeed, in order to avoid the noise undesired effect, the diagnosis algorithm does not use the instant value of factor (10) but an average of the last values associated to a given time window and this is what is plotted in Figure 3. These factors indicate the fault while their absolute value is bigger than 0.5 what occurs few time instants once the fault has appeared (k=150) and they keep activated till the end of the scenario. Consequently, derived from Figure 3, the fault is persis-tently detected by the interval observers associated to both limnimeters.

In Figure 4, the fault isolation result time evolution is plotted: factor01 (16), factorsign (21), factorsensit (23) factororder (24) associated to each fault hypothesis. In this fault scenario and for the used observer gains, only the L39 indicators are activated and they do persistently from fault time apparition till the end of the scenario. Consequently, the fault is clearly isolated by the interval observers for the considered observer gains.

0 50 100 150 200 250-30

-20

-10

0

10

20

30L39 residual & adaptive threshold time evolution

0 50 100 150 200 250-30

-20

-10

0

10

20

30

Time

L41 residual & adaptive threshold time evolution

Fig. 2.Time evolution of the residuals and their adaptive

thresholds


0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1L39 residual violation degree time evolution

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Time

L41 residual violation degree time evolution

Fig. 3.Evolution of the residual violation degree absolute value

0 50 100 150 200 2500

0.5

1factor01 time evolution

0 50 100 150 200 2500

0.5

1factorsign time evolution

0 50 100 150 200 2500

0.5

1factorsensit time evolution

0 50 100 150 200 2500

0.5

1

Time

factororder time evolution

L39

L39

L39

L39

Fig. 4. The factor indicators time evolution

b) Almost non-isolated fault case In this case, using the same fault scenario, the isolation al-gorithm can just isolate the fault for very few time instants because the fault detection does not last longer. This is be-cause the interval observers associated to both limnimeters are using high observer gain (w39=0.85 and w41=0.85 ) val-ues and consequently, their behavior is quite close to predic-tors: the model predicted values are almost fully contami-nated by the fault since few time instants later the fault has occurred. In Figure 5, the time evolution of fault detection test (9) is plotted showing the nominal residuals (8) are kept into the adaptive threshold for the most time instants once the fault has occurred. Consequently, their residual violation factors (10) whose absolute values are plotted in Figure 6 are hardly activated for very few time instants. Consequently, the fault isolation indicators (factor01 (16), factorsign (21), factors-ensit (23) factororder (24) ) associated to L39 fault hypothe-sis are hardly activated and therefore, the fault can not be isolated. The time evolution of these indicators is plotted in Figure 7.

0 50 100 150 200 250-30

-20

-10

0

10

20


0 50 100 150 200 250-30

-20

-10

0

10

20

30

Time


Fig. 5.Evolution of the residuals and their adaptive thresholds

0 50 100 150 200 2500

0.2

0.4

0.6

0.8


0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Time



0 50 100 150 200 2500

0.5


0 50 100 150 200 2500

0.5


0 50 100 150 200 2500

0.5


0 50 100 150 200 2500

0.5

1

Time


L39

L39

L39

L39


c) Non-persistent fault and partially wrong fault isolated case In this case, using the same fault scenario, the fault is clearly isolated from its occurrence but it is just for a time window because the L39 residual violation factor (10) does not indi-cate permanently the fault. Then, the isolation algorithm decreases the values associated to the L39 isolation indicators and on the other hand, it activates the indicators associated to L41 fault hypothesis. This fact could lead a wrong isola-tion result from that time instant. The fault isolation behav-ior described previously is obtained when L39 model uses an observer gain quite similar to the used in case b) while the corresponding to L41 is quite similar to the used in case a) (w39=0.7 and w41=0.5 ). In Figure 8, the time evolution of fault detection test (9) is plotted for both limnimeters while in Figure 9, it is the time evolution of the absolute value of the corresponding residual violation factor (10). Both Figures are in line with the be-havior described previously.

In Figure 10, the time evolution of the fault isolation indi-cators (factor01 (16), factorsign (21), factorsensit (23) fac-tororder (24) ) associated to L39 and L41 is plotted. This Fig-ure shows how the fault isolation algorithm is confused be-tween L39 and L41 fault hypothesis once L39 observer model does not longer indicate the fault. In spite of this fact, the factor factorsensit (23) associated to L39 continues having a bigger value than the corresponding to L41 and consequently, this fault hypothesis might still be the best candidate.


0 50 100 150 200 250-30

-20

-10

0

10

20


0 50 100 150 200 250-30

-20

-10

0

10

20

30

Time


Fig. 8.Evolution of the residuals and their adaptive thresholds

0 50 100 150 200 2500

0.2

0.4

0.6

0.8


0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Time



0 50 100 150 200 2500

0.5


0 50 100 150 200 2500

0.5


0 50 100 150 200 2500

0.5


0 50 100 150 200 2500

0.5

1

Time


L39

L39

L39

L39

L39 & L41

L39 & L41

L41

L39 & L41


5 Conclusions In general, model-based fault detection methods have inher-ent problems which cause fault detection not to be as good as it is needed and therefore, the fault isolation module may be confused. In particular, this paper shows fault isolation results may be very sensitive to the fault indication persis-tence provided by the fault detection module. It also shows that this lack of persistence can deteriorate the integration between the fault model-based detection and isolation mod-ules. When using interval observers, the fault indication persistence might be improved designing properly the ob-server gain matrix. Therefore, the fault isolation results might be also improved. As a further work, the observer gain influence on the fault isolation module should be stud-ied more quantitatively using different fault types. Thus, the comparison between absolute and relative fault isolation will be discussed in more properly terms analyzing the rela-tion between Tw and the observer gain. Moreover a proper observer gain matrix design should be studied in order to

avoid fault detection problems enhancing the fault isolation results. Acknowledgments The authors wish also to thank the support received by the Research Comission of the Generalitat of Catalunya (Grup SAC ref. 2005SGR00537) and by CICYT (ref. DPI-2005-05415) of Spanish Ministry of Education.

References [Cembrano et al., 2002] Cembrano, G. “Global Control of the Bar-

celona Sewerage System for Environment Protection”. In Pro-ceedings of IFAC World Congress, Barcelona, 2002.

[Chen and Patton, 1999] Chen J. and R.J. Patton. “Robust Model-Based Fault Diagnosis for Dynamic Systems”. Kluwer Aca-demic Publishers

[Chow et al., 1984] Chow, E., Willsky, A.”Analytical redundancy and the design of robust failure detection systems”. IEEE Transactions on Automatic Control, Volume: 29 , Issue: 7 , Jul 1984 , Pages: 603 – 614.

[Combastel et al., 2003] Combastel, C., S. Gentil, and J. P. Rog-non. “Toward a better integration of residual generation and di-agnostic decision,” in Proceedings of IFAC Safeprocess’03, Washington, USA, 2003.

[Gertler, 1998] Gertler, J. Fault Detection and Diagnosis in Engi-neering Systems. M. Dekker, 1998.

[Pulido et al., 2005] B. Pulido, V. Puig, T. Escobet, and J. Quevedo. “A new fault localization algorithm that improves the integration between fault detection and localization in dy-namic systems”. 16th International Workshop on Principles of Diagnosis (DX 05). Monterey, California, USA, June 1-3, 2005

[Petti et al., 1990] Petti, T.F., J. Klein, and P. S. Dhurjati. “Diag-nostic model processor: Using deep knowledge for process fault diagnosis,” AIChE Journal, vol. 36, p. 565.

[Ploix et al., 1999] Ploix, S., Adrot, O. and J. Ragot. “Parameter Uncertainty Computation in Static Linear Models”. 38th IEEE Conference on Decision and Control. Phoenix. Arizona. USA.

[Meseguer et al 2006] Meseguer, J., Puig, V., Escobet, T “Ob-server gain effect in linear observer-based fault detection” IFAC SAFEPROCESS’06.

[Puig et al., 2002] Puig, V., Quevedo, J., Escobet, T., De las Heras, S. “Robust Fault Detection Approaches using Interval Mod-els”. IFAC World Congress (b’02). Barcelona. Spain.

[Puig et al., 2003] Puig, V., Saludes, J., Quevedo, J. “Worst-Case Simulation of Discrete Linear Time-Invariant Dynamic Sys-tems”, Reliable Computing 9(4): 251-290, August.

[Puig et al., 2005] Puig, V. J. Quevedo, T. Escobet, and B. Pulido (2005). “A New Fault Diagnosis Algorithm that Improves the Integration of Fault Detection and Isolation” in ECC-CDC’05, Sevilla, Spain.


A Generalization of the GDE Minimal Hitting-Set Algorithmto Handle Behavioral Modes

Mattias NybergDepartment of Electrical Engineering, Linkoping University,

SE-581 83 Linkoping, SwedenPhone: +46-13285714, Fax: +46-13282035,

Email: [email protected]

Abstract

A generalization of the minimal hitting-set algo-rithm given by deKleer and Williams is presented.The original algorithm handles only one faultymode per component and only positive conflicts.In contrast, the new algorithm presented here han-dles more than two modes per component and alsonon-positive conflicts. The algorithm computes alogical formula that characterizes all diagnoses. In-stead of minimal diagnoses, or kernel diagnoses,some specific conjunctions in the logical formulaare used to characterize the diagnoses. These con-junctions are a generalization of both minimal andkernel diagnoses. From the logical formulas, it isalso easy to derive the set of preferred diagnoses.

1 IntroductionWithin the field of fault diagnosis, it has often been as-sumed that each component has only two possible behav-ioral modes, e.g. see[Reiter, 1987; deKleer and Williams,1987]. For this case, and given a set of conflict sets, it iswell known that a minimal hitting set corresponds to a min-imal diagnosis[Reiter, 1987]1. Algorithms for computingall minimal hitting sets have been presented in[Reiter, 1987;deKleer and Williams, 1987]. Improvements have later beengiven in e.g.[Greineret al., 1989; Wotawa, 2001].

In [Reiter, 1987; deKleer and Williams, 1987] it is assumedthat a conflict can only imply that some component is faulty.We call this apositive conflict[deKleeret al., 1992]. If allconflicts are positive, it is also well known that the set ofall minimal diagnoses characterizes all diagnoses[deKleerand Williams, 1987]. This will for example be the case ifthe faulty modes of the components have no fault models.However, if there are fault models, it is possible to have non-positive conflicts implying that some component is fault-free.

If there is a desire to compute something that character-izes all diagnoses when there are non-positive conflicts, the

1Reiter used the word diagnosis for what in this paper is calledminimal diagnosis.

concept of minimal hitting sets and the algorithms in[Reiter,1987; deKleer and Williams, 1987] can not be used. To solvethis, an alternative characterization based on so calledkerneldiagnoseswas proposed in[deKleeret al., 1992], where alsoan algorithm to compute the kernel diagnoses was given. Thekernel diagnoses characterize all diagnoses even in the caseof non-positive conflicts.

It has been noted in several papers that more than two pos-sible behavioral modes are useful for improving the perfor-mance of the diagnostic system, see e.g.[Struss and Dressler,1989; deKleer and Williams, 1989]. For this case, neitherminimal diagnoses or kernel diagnoses can be used to char-acterize all diagnoses. Further, none of the algorithms in[Re-iter, 1987; deKleer and Williams, 1987; deKleeret al., 1992]are applicable.

To be able to handle both more than two behavioral modesand non-positive conflicts, the present paper proposes a newcharacterization of all diagnoses. Conflicts and diagnosesarerepresented by logical formulas, and instead of minimal di-agnoses and kernel diagnoses, we use more general conjunc-tions on a specific form. In the special case of two behavioralmodes per component, these conjunctions become equivalentto kernel diagnoses, and in the case of only positive conflicts,they become equivalent to minimal diagnoses. Thus, the hereproposed framework can be seen as a generalization of bothminimal diagnoses and kernel diagnoses.

Another contribution is that we show that the minimal hit-ting set algorithm given in[deKleer and Williams, 1987] canin fact be generalized to compute the here proposed char-acterization. Note that, even though the papers[Struss andDressler, 1989; deKleer and Williams, 1989] consider morethan two behavioral modes per component, they are, in con-trast to the present paper, not concerned with the characteri-zation or computation of all diagnoses.

Under the assumption of only two behavioral modes percomponent, the minimal diagnoses can be argued to be themost desired diagnoses. This has been called the parsimonyprinciple, e.g. see[Reiter, 1987]. In the generalized case ofmore than two behavioral modes, the minimal diagnoses areno longer necessarily the most desired diagnoses. Instead theconcept ofpreferred diagnoseshas been defined in[Dresslerand Struss, 1992]. We will in this paper show how to obtainthese preferred diagnoses by means of the above mentionedlogical formulas.


The paper is organized as follows. In Section 2, the al-gorithm from[deKleer and Williams, 1987] is restated as areference. In Section 3, the logical framework is presented.Then the generalized version of the algorithm from[deKleerand Williams, 1987] is given in Section 4. Sections 5 and 6discuss the relation to minimal and kernel diagnoses. Finally,Section 7 describes how to compute the preferred diagnoses.All proofs of theorems have been placed in an appendix.

2 The Original AlgorithmThis section presents the original algorithm and its associatedframework as presented in[deKleer and Williams, 1987].However, since we have a different objective than in the orig-inal paper, we will not always use the same notation and nam-ing convention.

The system to be diagnosed is assumed to consist of a num-ber of components represented by a setC. A conflict is rep-resented as a setC ⊆ C. The meaning of a conflictC is thatnot all components inC can be in the normal fault-free mode.Thus only positive conflicts can be handled. A conflictC1 issaid to beminimal if there is no other conflictC2 such thatC2 ⊂ C1.

A diagnosisδ is also represented as a setδ ⊆ C. Themeaning of a diagnosisδ is that the components contained inδ are faulty and the components not contained inδ are faultfree. A diagnosisδ1 is said to beminimal if there is no otherdiagnosisδ2 such thatδ2 ⊂ δ1.

One fundamental relation between conflicts and diagnosesis that if C is the set of all minimal conflicts,δ is a diagnosisif and only if for all conflictsC ∈ C it holds thatδ ∩ C 6= ∅.

Given a set of diagnoses∆ and a conflictC the minimalhitting set algorithm in[deKleer and Williams, 1987] finds anupdated set of minimal diagnoses. A version of the algorithm,as described in the text of[deKleer and Williams, 1987], canbe written as follows.

Algorithm 1Input: a set of minimal diagnoses∆, and a conflict setCOutput: the updated set of minimal diagnosesΘ∆old = ∆forall δi ∈ ∆ do

if δi ∩ C = ∅ thenRemoveδi from∆old

forall c ∈ C doδnew := δi ∪ cforall δk ∈ ∆, δk 6= δi do

if δk ⊆ δnew then goto LABEL1end∆add := ∆add ∪ δnewLABEL1

endend

endΘ := ∆old ∪ ∆add

The algorithm has the properties that if∆ is the set of allminimal diagnoses, the algorithm outputΘ will contain allminimal diagnoses with respect to also the new conflictC.Further, it also holds thatΘ will contain only minimal diag-noses. Note that this algorithm does not require the conflict

C to be minimal, contrary to what has been stated in[Greineret al., 1989]. It can also be noted that the loop overδk ∈ ∆could be modified toδk ∈ ∆old, which would be more effi-cient since∆old is smaller than∆.

3 A Logical FrameworkEach component is assumed to be in exactly one out of sev-eral behavioral modes. A behavioral mode can be for exampleno-fault, abbreviatedNF , gain-faultG, biasB, open circuitOC, short circuitSC, unknown faultUF , or just faultyF .For our purposes, each component is abstracted to a variablespecifying the behavioral mode of that component. LetC de-note the set of such variables. For each component variableclet Rc denote thedomainof possible behavioral modes, i.e.c ∈ Rc.

We will now define a set of formulas to be used to expressthat certain components are in certain behavioral modes. Ifc is a component variable in the setC andM ⊆ Rc, the ex-pressionc ∈ M is a formula. For example, ifp is a pressuresensor, the formulap ∈ NF, G, UF means that the pres-sure sensor is in modeNF , G, or UF . If M is a singleton,e.g. M = NF, we will sometimes write alsop = NF .Further, the constant⊥ with value false, is a formula. Ifφandγ are formulas thenφ ∧ γ, φ ∨ γ, and¬φ are formulas.

In accordance with the theory of first order logic we saythat a formulaφ is a semantic consequence of another formulaγ, and writeγ |= φ, if all assignments of the variablesC thatmakeγ true also makeφ true. This can be generalized to setsof formulas, i.e.γ1, . . . , γn |= φ1, . . . , φm if and only ifγ1 ∧ · · · ∧ γn |= φ1 ∧ · · · ∧ φm. If it holds thatΓ |= Φ andΦ |= Γ, whereΦ andΓ are formulas or sets of formulas,ΦandΓ are said to be equivalent and we writeΓ ≃ Φ.

We will devote special interest to conjunctions on the form

c1 ∈ M1 ∧ c2 ∈ M2 ∧ · · · ∧ cn ∈ Mn (1)

where all components are unique, i.e.ci 6≡ cj if j 6= k, andeachMi is a nonempty proper subset ofRci

, i.e. ∅ 6= Mi ⊂Rci

. LetDi denote a conjunction on the form (1). From a setof such conjunctions we can then form a disjunction

D1 ∨ D2 ∨ . . .Dm (2)

Note that the different conjunctionsDi can contain differentnumber of components. We will say that a formula is inmax-imal normal formMNF if it is on the form (2) and has theadditional property that no conjunction is a consequence ofanother conjunction, i.e. for each conjunctionDi, there is noconjunctionDj , j 6= i, for which it holds thatDj |= Di.Note that the purpose of using formulas in MNF is that theyare relatively compact in the sense that an MNF-formula doesnot contain redundant conjunctions and that each conjunctiondoes not contain redundant assignments.

For an example consider the following two formulas con-taining pressure sensorsp1, p2, andp3, where all have thebehavioral modesRpi

= NF, G, B, UF.

p1 ∈ UF ∧ p2 ∈ B, UF ∨ p3 ∈ UF

p1 ∈ UF ∧ p2 ∈ B, UF ∨ p1 ∈ G, UF

The first formula is in MNF but not the second sincep1 ∈UF ∧ p2 ∈ B, UF |= p1 ∈ G, UF.


3.1 Conflicts and DiagnosesA conflict is assumed to be written using the logical languagedefined above. For example, if has been found that the pres-sure sensorp1 can not be in the modeNF at the same timeasp2 is in the modeB or NF , this gives the conflict

H = p1 ∈ NF ∧ p2 ∈ B, NF (3)

To relate this definition of conflict to the one used in Sec-tion 2, consider the conflictC = a, b, c. With the logicallanguage, we can write this conflict asa ∈ NF ∧ b ∈NF ∧ c ∈ NF.

Instead of conflicts, we will mostly use negated conflicts,so instead ofH we consider¬H . In particular we willuse negated conflicts written in MNF. For an example, thenegated conflict¬H , whereH is defined as in (3), can bewritten in MNF asp1 ∈ G, B, UF∨p2 ∈ G, UF. With-out loss of generality, we will from now on assume that allnegated conflicts are written on the form

c1 ∈ M1 ∨ c2 ∈ M2 ∨ · · · ∨ cn ∈ Mn (4)

wherecj 6≡ ck if j 6= k, and∅ 6= Mi ⊂ Rci. This means

that (4) is in MNF.A system behavioral modeis a conjunction containing a

unique assignment of all components inC. For example ifC = p1, p2, p3, a system behavioral mode could be

p1 = UF ∧ p2 = B ∧ p3 = NF

We consider the termdiagnosisto refer to a system behavioralmode consistent with all negated conflicts. More formally, ifP is the set of all negated conflicts, a system behavioral moded is adiagnosisif d ∪ P 6|= ⊥ or equivalentlyd |= P.

To relate this definition of diagnosis to the one used in Sec-tion 2, assume thatC = a, b, c, d and consider the diagnosisδ = a, b. With the logical language, we can write this di-agnosis asa = F ∧ b = F ∧ c = NF ∧ d = NF .

4 The Generalized AlgorithmWith only small modifications, the original algorithm statedin Section 2 can be made to work with logical MNF-formulasinstead of sets. The result is an algorithm that handles morethan two behavioral modes per component and also non-positive conflicts. With the modification, the algorithm willtake as inputs, a formulaD and a negated conflictP , bothwritten in MNF. The purpose of the algorithm is then to de-rive a new formulaQ in MNF such thatQ ≃ D ∧ P .

The modifications are the following:

• Instead of using a set of minimal diagnoses∆ as input,use a formulaD in MNF. Note thatD is not restricted tobe a disjunction of system behavioral modes, but insteadcan be a disjunction of conjunctions on the form (1).

• Instead of using a conflict setC as input, use a negatedconflictP on the form (4).

• Instead of checking the conditionδi ∩ C = ∅, check theconditionDi 6|= P .

• Instead of the assignmentδnew := δi ∪ c, find a con-junctionDnew in MNF such thatDnew ≃ Di ∧ Pj .

• Instead of checking the conditionδk ⊆ δnew, check theconditionDnew |= Dk.

In the algorithm we will use the notationDi ∈ D to denotethe fact thatDi is a conjunction inD. The algorithm can nowbe stated as follows:

Algorithm 2Input: a formulaD in MNF, and a negated conflictPOutput:QDold = Dforall Di ∈ D do

if Di 6|= P thenRemoveDi fromDold

forall Pj ∈ P doLetDnew be a conjunction in MNF such

thatDnew ≃ Di ∧ Pj

forall Dk ∈ D, Dk 6= Di doif Dnew |= Dk then goto LABEL1

endDadd := Dadd ∨ Dnew

LABEL1end

endendQ := Dold ∨ Dadd

To keep the algorithm description “clean”, some operationshave been written in a simplified form. More details are dis-cussed in Section 4.2 below. Note that an improvement cor-responding to the change of∆ to ∆old in Algorithm 1 is notpossible for the generalized algorithm.

The algorithm is assumed to be used in an iterative manneras follows. First when only one conflictP1 is considered, thediagnoses are already described byP1. Thus, the algorithmis not needed. When a second conflictP2 is considered, thealgorithm is fed withD = P1 andP = P2, and produces theoutputQ such thatQ ≃ P1 ∧ P2. Then, for each additionalconflict Pn that is considered, the inputD is the old outputQ.

When the algorithm is used in this way, the following re-sults can be guaranteed.

Theorem 1 Let P be a set of negated conflicts that is not in-consistent, i.e.P 6|= ⊥, and letQ be the output from Algo-rithm 2 after processing all negated conflicts inP. Then itholds thatQ ≃ P.

Theorem 2 The outputQ from Algorithm 2 is in MNF.

The proofs for these results can be found in the appendix.

4.1 ExampleTo illustrate the algorithm, consider the following small ex-ample whereC = p1, p2, p3 and the domain of behavioralmodes for each component isRpi

= NF, G, B, UF:

D =D1 ∨ D2 = p1 ∈ G, B, UF ∨ p3 ∈ G, UF

P =P1 ∨ P2 = p2 ∈ B, UF ∨ p3 ∈ G, B, UF

First the conditionD1 6|= P is fulfilled which means thatD1

is removed fromDold and the inner loop of the algorithm isentered. There aDnew is created such thatDnew ≃ D1 ∧


P1 = p1 ∈ G, B, UF ∧ p2 ∈ B, UF. This Dnew isthen compared toD2 in the conditionDnew |= D2. Thecondition is not fulfilled which means thatDnew is added toDadd. Next aDnew is created such thatDnew ≃ D1 ∧ P2 =p1 ∈ G, B, UF ∧ p3 ∈ G, B, UF. Also this time theconditionDnew |= D2 is not fulfilled, implying thatDnew

is added toDadd. Next, the conjunctionD2 is investigatedbut sinceD2 |= P holds,D2 is not removed fromDold andthe inner loop is not entered. The algorithm output is finallyformed as

Q := Dold ∨Dadd = D2 ∨ (D1 ∧ P1 ∨ D1 ∧ P2) =

=p3 ∈G, UF ∨ p1 ∈G, B, UF ∧ p2 ∈B, UF∨

∨ p1 ∈G, B, UF ∧ p3 ∈G, B, UF

It can be verified thatQ ≃ D ∧ P . Also, it can be seen thatQ is in MNF.

4.2 Algorithm DetailsTo implement the algorithm, some more details need to beknown. The first is how to check the conditionDi |= P . Toillustrate this, consider an example whereDi contains com-ponentsc1, c2, andc3 andP componentsc2, c3, andc4. SinceD is in MNF, andP in the form (4),Di andP will have theform

Di =c1 ∈ MD1 ∧ c2 ∈ MD

2 ∧ c3 ∈ MD2 (5)

P =c2 ∈ MP2 ∨ c3 ∈ MP

3 ∨ c4 ∈ MP4 (6)

We realize that the conditionDi |= P holds if and only ifMD

2 ⊆ MP2 or MD

3 ⊆ MP3 . Thus, this example shows that

in general,Di |= P holds if and only ifDi andP contain atleast one common componentci whereMD

i ⊆ MPi .

The second detail is how to find an expressionQnew inMNF such thatQnew ≃ Di ∧ Pj . To illustrate this, consideran example whereDi contains componentsc1 andc2, andPj

the componentc2. SinceD is in MNF, andP in the form (4),Di andPj will have the form

Di =c1 ∈ MD1 ∧ c2 ∈ MD

2 (7a)

Pj =c2 ∈ MP2 (7b)

ThenQnew will be formed asDnew = c1 ∈ MD1 ∧ c2 ∈

MD2 ∩ MP

2 which means thatDnew ≃ Di ∧ Pj . If it holdsthat MD

2 ∩ MP2 6= ∅, Dnew will be in MNF. Otherwise let

Dnew = ⊥. The checkDnew |= Dk will then immediatelymake the algorithm jump toLABEL1meaning thatDnew willnot be added toDadd.

The third detail is how to check the conditionDnew |= Dk.To illustrate this, consider an example whereDnew containscomponentsc1 and c2, andDk the componentsc2 and c3.SinceDnew andD are both in MNF,Dnew andDk will havethe form

Dnew =c1 ∈ Mn1 ∧ c2 ∈ Mn

2 (8a)

Dk =c2 ∈ MD2 ∧ c3 ∈ MD

3 (8b)

Without changing their meanings, these expressions can beexpanded so that they contain the same set of components:

D′new =c1 ∈ Mn

1 ∧ c2 ∈ Mn2 ∧ c3 ∈ Rc3

(9)

D′k =c1 ∈ Rc1

∧ c2 ∈ MD2 ∧ c3 ∈ MD

3 (10)

Now we see that the conditionDnew |= Dk holds if and onlyif Mn

1 ⊆ Rc1, Mn

2 ⊆ MD2 , andRc3

⊆ MD3 . The first

of these three conditions is always fulfilled and the third cannever be fulfilled since, by definition of MNF,MD

3 ⊂ Rc3.

Thus, this example shows thatDnew |= Dk holds if and onlyif (1), Dk contains only components that are also contained inDnew, and (2), for all componentsci contained in bothDnew

andDk it holds thatMni ⊆ MD

i .The fourth detail to be considered is the expression

Dadd := Dadd ∨ Dnew. SinceDadd is not assigned fromthe beginning, this expression is to be read asDadd := Dnew

whenDadd is unassigned.Finally, note thatDold or Dadd may be unassigned or

empty at some places in the algorithm. In that case, e.g. inQ := Dold ∨ Dadd, the missing term can just be neglected.

5 Relation to Minimal DiagnosesThe concept of minimal diagnoses was originally proposedin [Reiter, 1987; deKleer and Williams, 1987] for systemswhere each component has only two possible behavioralmodes, i.e. the normal fault-free mode and a faulty mode.Minimal diagnoses have two attractive properties. Firstly,they represent the “simplest” diagnoses and are therefore of-ten desired when prioritizing among diagnoses. Secondly, incase there are only positive conflicts, the minimal diagnosescharacterize the set of all diagnoses. These two propertieswill now be investigated for the generalized case of more thantwo modes per component and non-positive conflicts.

5.1 “Simplest” PropertyFor the case of more than two modes per component, theconcept ofpreferred diagnoseswas defined in[Dressler andStruss, 1992] as a generalization of minimal diagnoses. Thebasic idea is that the behavioral modes for each componentare ordered in a partial order defining that some behavioralmodes are more preferred than other. For example,NF isusually preferred over any other mode, and a simple electri-cal fault, such as short-cut or open circuit, may be preferredover other more complex behavioral modes. Further, an un-known faultUF may be the least preferred mode.

For a formal definition letb1c ≥c b2

c denote the fact thatfor componentc, the behavioral modeb1

c is equally or morepreferred thanb2

c. For each component, this relation forms apartial order on the behavioral modes. Further, these relationsinduce a partial order on the system behavioral modes. Letd1

andd2 be two system behavioral modesdi = ∧c∈C(c = bic).

Then we writed1 ≥ d2 if for all c ∈ C it holds thatb1c ≥c b2

c.A preferred diagnosis can then formally be defined as a diag-nosisd such that there is no other diagnosisd′ whered′ > d.In Section 7 we will discuss how the preferred diagnoses canbe obtained from an MNF formula representing all diagnoses.Note that in the case of only two modes, preferred diagnosesare exactly the minimal diagnoses.

Remark: One may ask what “preferred” or “simplest” di-agnoses means. One possible formal justification is the fol-lowing. Let P (d) denote the prior probability of the systembehavioral moded = ∧c∈Cc = bc. We assume that faultsoccur independently of each other which means thatP (d) =


∏c∈C P (c = bc) whereP (c = bc) is the prior probability

that componentc is in behavioral modebc. If Q is a formulasuch thatQ ≃ P, it holds thatP (d|P) = P (d ∧ Q)/P (Q).This means thatP (d|P) = P (d)/P (Q) if d |= P, i.e. if d is adiagnosis, andP (d|P) = 0 if d 6|= P, i.e. if d is not a diagno-sis. For a given setP, the termP (Q) is only a normalizationconstant, which means that to compareP (d|P) for differentdiagnoses it is enough to consider the priorsP (d). To knowthe exact value of a priorP (c = bc) may be very difficultor even impossible. Therefore one may assume that for eachcomponent, the priors are unknown but at least partially or-dered. Under this assumption, and given the set of negatedconflicts, the preferred diagnoses are then the most probableones.

5.2 Characterizing PropertyNow we investigate how the characterizing property of mini-mal diagnoses can be generalized to the case of more than twomodes and the presence of non-positive conflicts. In somespecial cases, the preferred diagnoses characterize all diag-noses with the help of the partial order≥. That is, ifd1 is adiagnosis and ifd2 < d1, we know that alsod2 is a diagnosis.This is always true when there are only two modes per com-ponent and only positive conflicts, which in turn is guaranteedwhen there are no fault models. Note that it may also be truein a case with more than two modes, even in the presence offault models. However this does not hold generally.

In an MNF-formula, the conjunctions have the propertythat they characterize all diagnoses. For example considerthe case when the components are=a, b, c, d, e, R =NF, B, G, UF for all components, anda ∈ B, UF∧b ∈G, UF is one of the conjunctions in an MNF formula. Byletting each diagnosis be represented as an ordered set cor-responding to〈a, b, c, d, e〉, this single conjunction character-izes the diagnoses

B, UF × G, UF × NF, B, G, UF×

× NF, B, G, UF × NF, B, G, UF

which is 256 diagnoses.For another example assume that each of the components

C = a, b, c, d has only two modes, i.e.R = NF, F.A conjunctiona ∈ F ∧ b ∈ F would then characterizeall diagnosesF × F × NF, F × NF, F. In Sec-tion 2 this conjunction would be represented bya, b. If allconflicts are positive, all conjunctions would be on this form,and there is a one-to-one correspondence between the con-junctions in an MNF-formula and the minimal diagnoses inthe original framework described in Section 2.

If there is a fault model for the modeF of a componenta, the non-positive conflicta ∈ F may appear. Assumealso that a conflictb = NF appears. This has the conse-quence that a formula in MNF, describing all diagnoses, mayfor example contain a conjunctiona ∈ NF ∧ b ∈ F.This conjunction characterizes all diagnosesNF×F×NF, F×NF, F, and this is a so calledkernel diagnosis(see the next section). Note that to represent this conjunctionis not possible using sets as described in Section 2. Note alsothat there is one minimal diagnosis in this example, namely

a = NF ∧ b = F ∧ c = NF ∧ d = NF , and this minimaldiagnosis does not characterize all diagnoses.

6 Relation to Kernel DiagnosesThe paper[deKleeret al., 1992] definespartial diagnosisandkernel diagnosis. This was done assuming only two modesper component. The purpose of kernel diagnoses is that theset of all kernel diagnoses characterizes all diagnoses evenin the case when there are non-positive conflicts. As notedin [deKleeret al., 1992], also a subset of kernel diagnoses issometimes sufficient to characterize all diagnoses.

In the context of this paper we can define partial diagnosisas a conjunctiond of mode assignments such thatd |= P.Then, a kernel diagnosis is partial diagnosisd such that thereis no other partial diagnosisd′ whered |= d′.

According to the following theorem, the outputQ fromAlgorithm 2 is, in the two-mode case, a disjunction of kerneldiagnoses.

Theorem 3 Let each component have only two possible be-havioral modes, letP be a set of negated conflicts, and letQbe the output from Algorithm 2 after processing all negatedconflicts inP. Then it holds that each conjunction ofQ is akernel diagnosis.

Note that the MNF property alone does not guarantee that allconjunctions are kernel diagnoses. This can be seen in thefollowing formula which is in MNF.

c1 = N ∧ c2 = N ∨ c1 = N ∧ c2 = F (11)

All diagnoses represented by (11) are characterized by thesingle kernel diagnosisc1 = N . Therefore none of the con-junctions in (11) are kernel diagnoses.

Even though the paper[deKleeret al., 1992] defines par-tial and kernel diagnoses for the case of only two modesper component, the definition of partial and kernel diagnosesgiven above is applicable also to the case of more than twomodes per component. However, the conjunctions in theoutputQ from Algorithm 2 will for this case not be ker-nel diagnoses. Instead each conjunction represents a set ofpartial diagnoses, e.g. the first conjunction of (12) repre-sents the two partial diagnosesc1 = E ∧ c3 = B andc1 = E ∧ c3 = G. Since the second conjunction of (12)represents e.g.c1 = E ∧ c2 = E ∧ c3 = B, it is also obviousthat the partial diagnoses represented by each conjunctionarenot necessarily kernel diagnoses.

7 Extracting Preferred DiagnosesIn Section 5 it was concluded that the conjunctions in the out-put Q from Algorithm 2 characterize all diagnoses, and inthe special case of two modes per component and only pos-itive conflicts, there is a one-to-one correspondence betweenMNF-conjunctions and the minimal diagnoses. This specialcase has also the property that if we study each conjunctionin an MNF formulaQ separately, it will have only one pre-ferred diagnosis. This preferred diagnosis is a also a preferreddiagnosis when considering the whole formulaQ. The con-sequence is that it is straightforward to extract the preferreddiagnosis from a formulaQ. In the general case, there is no


such guarantee. For example, in the two-mode case and whensome conflicts are non-positive, which means that the negatedconflict will contain some assignmentc = NF , there may bea conjunction not corresponding to a preferred diagnosis.

For an example with more than two modes, considertwo componentsc1 and c2 whereRci

= NF, E, F andNF >ci

E >ciF , and a third componentc3 where

Rci= NF, B, G with the only relationsNF >c3

B andNF >c3

G. Then consider the MNF-formula

Q = c1 ∈ E ∧ c3 ∈ B, G∨

c1 ∈ E, F ∧ c2 ∈ E, F ∧ c3 ∈ B, G (12)

The preferred diagnoses consistent with the first conjunctionarec1 = E ∧ c2 = NF ∧ c3 = B andc1 = E ∧ c2 = NF ∧c3 = G. The preferred diagnoses consistent with the secondarec1 = E∧c2 = E∧c3 = B andc1 = E∧c2 = E∧c3 = G.As seen, the two diagnosesc1 = E ∧ c2 = E ∧ c3 = B andc1 = E ∧ c2 = E ∧ c3 = G are not preferred diagnoses ofthe whole formulaQ.

The example shows that preferred diagnoses can not be ex-tracted simply by considering one conjunction at a time. In-stead the following procedure can be used. For each conjunc-tion in Q, find the preferred diagnoses consistent with thatconjunction, and collect all diagnoses found in a setΨ. ThesetΨ may contain non-preferred diagnoses. These can be re-moved by a simple pairwise comparison. Note that the setΨneed not to be calculated for every new negated conflict thatis processed. Instead only at the time the preferred diagnosesare really needed, for example before a service task is to becarried out, the setΨ needs to be calculated.

One may ask how much extra time that is needed for thecomputation of the preferred diagnoses, compared to the timeneeded to process all negated conflicts and computeQ. Togive an indication of this, the following empirical experimentwas set up. A number of 132 test cases were randomly gen-erated. The test cases represent systems with between 4 and7 components, where each component has 4 possible behav-ioral modes. The number of negated conflicts varies between2 and 12.

0 0.5 1 1.5 2 2.5 3 3.5

10−1

100

time

[s]

reference time [s]

Figure 1: The total execution times for computingQ (dashedline) and preferred diagnoses (solid line).

In Figure 1, the results for the 132 test cases are shown.The reference time on the x-axis is chosen to be the computa-tion time needed to computeQ. As seen, the figure indicatesthat the extra time needed to compute preferred diagnosesfrom the MNF formulaQ, is almost negligible compared tothe time needed to compute only the MNF formula.

8 ConclusionsIn this paper the minimal hitting-set algorithm from[deK-leer and Williams, 1987] has been generalized to handle morethan two modes per component and also non-positive con-flicts. This has been done by first establishing a frameworkwhere all conflicts and diagnoses are represented with spe-cial logical formulas. Then the original minimal hitting-setalgorithm needed only small modifications to obtain the de-sired results. It has been formally proven thatQ ≃ P, i.e.the algorithm output is equivalent to the set of all diagnoses.Further it was proven that the algorithm outputQ is in theMNF-form that guarantees thatQ does not contain redundantconjunctions.

In a comparison with the original framework where con-flicts and diagnoses are represented by sets, it was concludedthat the conjunctions in the outputQ, from the generalizedalgorithm, are a true generalization of the minimal diagnosesobtained from the minimal hitting-set algorithm. It has alsobeen concluded that the conjunctions are a true generaliza-tion of kernel diagnoses. Since, for the case of more thantwo mode per component, minimal diagnoses do not neces-sarily correspond to the most desired diagnoses, it was insteadshown how preferred diagnoses could be obtained from theconjunctions with a reasonable amount of effort.

References[deKleer and Williams, 1987] J. deKleer and B.C. Williams.

Diagnosing multiple faults.Artificial Intelligence, Issue 1,Volume 32:pp. 97–130, 1987.

[deKleer and Williams, 1989] J. deKleer and B.C. Williams.Diagnosis with behavioral modes. IJCAI, pages 1324–1330, 1989.

[deKleeret al., 1992] J. deKleer, A.K. Mackworth, andR. Reiter. Characterizing diagnoses and systems.ArtificialIntelligence, Issue 2-3, Volume 56:pp. 197–222, 1992.

[Dressler and Struss, 1992] O. Dressler and P. Struss. Backto defaults: Characterizing and computing diagnoses ascoherent assumption sets. ECAI, pages 719–723, 1992.

[Greineret al., 1989] R. Greiner, B.A. Smith, and R.W.Wilkerson. A correction to the algorithm in reiter’s theoryof diagnosis.Artificial Intelligence, 41(1):79–88, 1989.

[Reiter, 1987] R. Reiter. A theory of diagnosis from firstprinciples. Artificial Intelligence, 32(1):57–95, April1987.

[Struss and Dressler, 1989] P. Struss and O. Dressler. ’phys-ical negation’ - integrating fault models into the generaldiagnosis engine. IJCAI, pages 1318–1323, 1989.

[Wotawa, 2001] F. Wotawa. A variant of reiter’s hitting-setalgorithm. Information Processing Letters, 79(1):45–51,2001.


AppendixLemma 1 The outputQ from Algorithm 2 contains no twoconjunctions such thatQ2 |= Q1.

PROOF. Assume the contrary, thatQ1 andQ2 are two con-junctions inQ and Q2 |= Q1. There are three cases thatneed to be investigated: (1)Q1 ∈ Dold, Q2 ∈ Dadd, (2)Q2 ∈ Dold, Q1 ∈ Dadd, (3) Q1 ∈ Dadd, Q2 ∈ Dadd.

1) The factQ2 ∈ Dadd means thatDnew = Q2 at somepoint. SinceQ1 ∈ Dold, Dnew must then have beencompared toQ1. SinceQ2 has really been added, it can-not have been the case thatQ2 |= Q1.

2) SinceQ1 ∈ Dadd, it holds thatQ1 = Di ∧ Pj for someDi ∈ D. The factQ2 |= Q1 implies thatQ2 |= Di ∧Pj |= Di. This is a contradiction sinceQ2 ∈ D, andDis in MNF.

3) There are three cases: (a)Q2 = Di ∧ Pj2 |= Di ∧Pj1 = Q1, (b) Q2 = Di2 ∧ Pj |= Di1 ∧ Pj = Q1, (c)Q2 = Di2 ∧Pj2 |= Di1 ∧Pj1 = Q1, where in all cases,Pj1 6= Pj2 andDi1 6= Di2.

a) We know thatDi andP are formulas on forms likeDi = a ∈ A ∧ b ∈ B ∧ c ∈ C andP = a ∈Ap ∨ b ∈ Bp respectively. This means thatQ1 =a ∈ A∩Ap∧b ∈ B∧c ∈ C andQ2 = a ∈ A∧b ∈B ∩ Bp ∧ c ∈ C. The factQ2 |= Q1 implies thatA ⊆ A ∩ Ap which further means thatA ⊆ Ap.This impliesDi = a ∈ A ∧ b ∈ B ∧ c ∈ C |= a ∈Ap |= P . Thus,Q1 andQ2 are never subject to beadded toDadd.

b) We have thatQ2 = Di2∧Pj |= Di1∧Pj |= Di1 ∈D. This means thatQ2 = Di2 ∧ Pj can not havebeen added toDadd.

c) We have thatQ2 = Di2 ∧ Pj2 |= Di1 ∧ Pj1 |=Di1 ∈ D. This means thatQ2 = Di2 ∧Pj2 can nothave been added toDadd.

All these investigations show that it impossible thatQ2 |=Q1.

Theorem 2 The outputQ from Algorithm 2 is in MNF.

PROOF. From Lemma 1 it follows thatQ contains no twoconjunctions such thatQ2 |= Q1. All conjunctions inDold

are trivially on the form specified by (1). All conjunctions inDadd are also on the form (1) because of the requirement onDnew. ThusQ is in MNF.

Lemma 2 Let Q be the output from Algorithm 2 after pro-cessing all negated conflicts inP. For any two conjunctionsQ1 andQ2 in Q, there is no componentc and conjunctionDsuch thatQ1 ≃ D ∧ c ∈ A1 andQ2 ≃ D ∧ c ∈ A2 whereA1 ⊆ Rc andA2 ⊆ Rc.

PROOF. Assume that there is a componentc and conjunctionD such thatQ1 ≃ D ∧ c ∈ A1 andQ2 ≃ D ∧ c ∈ A2. Wecan writeQ1 asc ∈ Aφ1 ∧ D1 whereAφ1 is the intersectionof the setsMj obtained from allP ∈ φ1 ⊆ P, andD1 isthe conjunction of onePj obtained from everyP ∈ P \ φ1.Similarly we writeQ2 asc ∈ Aφ2 ∧ D2.

We can find aD′ such thatD′ ≃ D1 ≃ D2 and whereD′ is the conjunction of onePj obtained from everyP ∈P \ (φ1 ∩φ1). Then letD∗ = c ∈ Aφ1∩φ2 ∧D′ which meansthatQ1 |= c ∈ Aφ1∩φ2 ∧ D1 ≃ D∗. Similarly we can obtainthe relationQ2 |= c ∈ Aφ1∩φ2 ∧ D2 ≃ D∗. By constructionof D∗ it can be realized thatD∗ |= Qk for some conjunctionQk in Q. Because of this relation bothQ1 andQ2 can notbe contained inQ which is a contradiction. This means thatthere can not be a componentc and conjunctionD such thatQ1 ≃ D ∧ c ∈ A1 andQ2 ≃ D ∧ c ∈ A2.

Lemma 3 Let Q = Dold ∧ Dadd be the output from Algo-rithm 2 after processing all test negated conflicts inP. If Dim

is not contained inDold, and the setDim∧Pj is not contained

in Dadd, after running the algorithm, then there is aDim+1

such thatDim∧ Pj |= Dim+1

andDim+1∧ Pj 6|= Dim

∧ Pj .

PROOF. The fact thatDimis not contained inDold means

that the inner loop of the algorithm must have been enteredwhenDi = Dim

. Then the fact thatDim∧Pj is not contained

in Dadd, means thatDim∧ Pj |= Dk for someDk, k 6= im.

By choosingim+1 = k, this givesDim∧ Pj |= Dim+1

.Next we prove thatDk ∧ Pj 6|= Di ∧ Pj . Let the sin-

gle assignment inPj be a ∈ Ap. We will divide the proofinto four cases: (1)a 6∈ comps Di, a 6∈ comps Dk, (2)a ∈ comps Di, a 6∈ comps Dk, (3) a 6∈ comps Di,a ∈ comps Dk, and (4)a ∈ comps Di, a ∈ comps Dk.

1) The factDi ∧ Pj |= Dk would imply Di |= Dk whichis impossible becauseD is in MNF.

2) This means thatDi can be written asDi = D′∧a ∈ Ai.The factDi∧Pj |= Dk would then imply thatD′ |= Dk

and consequently thatDi |= Dk, which is impossiblebecauseD is in MNF.

3) First assume thatDi contains a componentc 6∈ Dk.Note that this component is not componenta. Thiswould imply thatc is not contained inPj . Thus the com-ponents ofDi ∧ Pj is a not a subset of the componentsof Dk ∧ Pj , which impliesDk ∧ Pj 6|= Di ∧ Pj . Thecase left to investigate is when the components ofDi area subset of the components ofDk.Assume thatDk ∧ Pj |= Di ∧ Pj . This relation can bewritten D′

k ∧ a ∈ Ap ∩ Ak |= Di ∧ a ∈ Ap whereD′k

is a conjunction not containing componenta. For thisrelation to hold it must hold thatD′

k |= Di. This meansthat Dk = a ∈ Ak ∧ D′

k |= Di which is impossiblebecauseD is in MNF.

4) Assume thatDk ∧ Pj |= Di ∧ Pj . This relation can bewrittenD′

k ∧ a ∈ Ap ∩ Ak |= D′i ∧ a ∈ Ap ∩ Ai where

D′k andD′

i are conjunctions not containing componenta. This relation would implyD′

k |= D′i. Further on, the

factDi ∧ Pj |= Dk can be writtena ∈ Ap ∩Ai ∧D′i |=

a ∈ Ak ∧ D′k, which implies thatD′

i |= D′k. Thus we

haveD′i ≃ D′

k and the only possible difference betweenDi andDk is the assignment of componenta. Lemma 2says this is impossible.

With i = im andk = im+1, these four cases have shown thatDim+1

∧ Pj 6|= Dim∧ Pj .


Lemma 4 Let D be the output from Algorithm 2 after pro-cessing all negated conflicts inPn−1, andQ the output givenD andP as inputs. For each conjunctionDi in D and Pj

in P it holds that there is a conjunctionQk in Q such thatDi ∧ Pj |= Qk.

PROOF. If, after running the algorithm,Di is contained inDold, then the lemma is trivially fulfilled. If insteadDi ∧ Pj

is contained inDadd, then the lemma is also trivially fulfilled.Study now the case whereDi is contained inDold andDi∧Pj

is not contained inDadd. We can then apply Lemma 3 withP = Pn−1∪P. This gives us aDim+1

such thatDim∧Pj |=

Dim+1andDim+1

∧ Pj 6|= Dim∧ Pj .

If Dim+1is contained inDold, then the lemma is fulfilled.

If insteadDim+1∧ Pj is contained inDadd, note thatDim

∧Pj |= Dim+1

impliesDim∧ Pj |= Dim+1

∧ Pj . This meansthat the lemma is fulfilled. In this way we can repeatedlyapply Lemma 3 as long as the newDim+1

obtained is notcontained inDold andDim+1

∧ Pj not contained inDadd.We will now prove that after a finite number of applications

of Lemma 3 we obtain aDim+1whereDim+1

is containedin Dold or Dim+1

∧ Pj is contained inDadd. Note that thateach application of Lemma 3 guarantees thatDim

∧ Pj |=Dim+1

∧ Pj andDim+1∧ Pj 6≃ Dim

∧ Pj . This fact itselfimplies that there cannot be an infinite number of applicationsof Lemma 3.

Theorem 1 Let P be a set of negated conflicts that is notinconsistent, i.e.P 6|= ⊥, and letQ be the output from Al-gorithm 2 after processing all negated conflicts inP. Then itholds thatQ ≃ P.

PROOF. LetPn−1 denote the set all negated conflicts inP

exceptP . Then it holds thatP ≃ Pn−1 ∪ P ≃ D ∧ P .Lemma 4 implies thatD ∧ P |= Q. Left to prove isQ |=D ∧P . Take arbitrary conjunctionQk in the outputQ. If Qk

is in Dold, then it must be in alsoD, i.e. Qk = Di for someconjunctionDi in D. The fact thatDi is in Dold means alsothatDi |= P . ThusQk = Di |= D ∧ P .

Lemma 5 LetPn−1∪Pn be a set of negated conflicts, and leteach component have only two possible behavioral modes. IfD is the output from Algorithm 2 after processing all negatedconflicts inPn−1, then a new call to the algorithm with inputsD andPn gives an outputQ in which each conjunction is akernel diagnosis.

PROOF. Take an arbitrary conjunctionQk in Q. It holds thatQk ≃ Di ∧ Pj for some conjunctionDi in D and some con-junctionPj in Pn. If Qk ≃ Di, thenQk is a kernel diagnosissinceDi is. Next we investigate the other caseQk 6≃ Di.

Assume thatQk is not a kernel diagnosis. The assignmentPj can be written ascp = Mp. Thus, we can writeQk asQk = Di ∧ (cp = Mp). Since by assumptionQk is not akernel diagnosis, we can remove one assignment, eithercp =Mp or some assignmenta = Ma in Di, from Qk and obtaina partial diagnosis. The partial diagnosis obtained is eitherDi or D ∧ cp = Mp, whereDi = D ∧ a = Ma. Studyfirst the case whereDi is the partial diagnosis. By definition,this means thatDi |= Pn−1 ∪ Pn, which impliesDi |=Pn. This means thatDi would not be removed fromDold

and thus become one conjunction inQ. SinceQk = Di ∧(cp = Mp) |= Di, bothQk andDi cannot be conjunctionsin Q becauseQ is in MNF according to Theorem 2. Thiscontradiction shows thatDi can not be a partial diagnosis.

Next, study the case whereD ∧ cp = Mp is the partialdiagnosis, and letMa denote the complementary element toMa. This means that bothD ∧ cp = Mp ∧ a = Ma andD∧cp = Mp∧a = Ma are partial diagnoses. This means, bydefinition, thatD∧cp = Mp∧a = Ma |= Pn−1∪Pn ≃ Q.SinceQk = D∧a = Ma∧cp = Mp, andQ is in MNF, theremust be anotherQm such thatD∧cp = Mp∧a = Ma |= Qm.According to Lemma 2, it can not hold thatQm = D ∧ cp =Mp ∧ a = Ma. Therefore we can remove one assignmentfrom D ∧ cp = Mp ∧ a = Ma and still obtain a conjunctiond such thatd |= Qm. Note then that it can not hold thatd = D ∧ cp = Mp since this would imply thatQk |= Qm.

Now we investigate the cased = D ∧ a = Ma. Let Ωdenote the set of assignments contained inD. The fact thatQk = D ∧ a = Ma ∧ cp = Mp means that each negatedconflictP ∈ Pn−1∪Pn contains an assignment inΩ∪a =Ma ∪ cp = Mp.

Next, D ∧ a = Ma |= Qm means thatQm contains asubset of the assignments contained inD ∧ a = Ma. Thisfurther means that each negated conflictP ∈ Pn−1 ∪ Pncontains an assignment fromΩm ∪ a = Ma. This meansthat aP ′ that does not contain any assignment fromΩm mustcontain the assignmenta = Ma. The consequence of thisis thatP ′ cannot contain the assignmenta = Ma. Since itwas concluded above that eachP contains an assignment inΩ∪a = Ma∪cp = Mp,P ′ must then contain the assign-mentcp = Mp. Thus each negated conflictP ∈ Pn−1∪Pncontains an assignment fromΩm ∪ cp = Mp.

We can now select one assignment from eachP ∈ Pn−1 ∪Pn but with the requirement that the selected assignmentmust becp = Mp or contained inΩ. By forming a conjunc-tion Φ of these assignments, it will hold thatD∧ cp = Mp |=Φ. ThereforeQk = D ∧ a = Ma ∧ cp = Mp |= Φ. If Φis not one of the conjunctions inQ, there will be anotherQv

such thatΦ |= Qv. This means thatQk |= Qv andQi cannotbe contained inQ, which is a contradiction. Thus we haveshown that it cannot hold thatd = D ∧ a = Ma, and there-fore thatD ∧ cp = Mp cannot be a partial diagnosis. Thisfurther means thatQk must be a kernel diagnosis.

Theorem 3 Let each component have only two possiblebehavioral modes, letP be a set of negated conflicts, and letQ be the output from Algorithm 2 after processing all negatedconflicts inP. Then it holds that each conjunction ofQ is akernel diagnosis.

PROOF. It is not difficult to realize that, after processing thefirst two negated conflicts inP, each conjunction of the outputQ is a kernel diagnoses. For each further negated conflict thatis processed, each conjunction of the new output will be akernel diagnosis according to Lemma 5.


Runtime Fault Detection and Localization in Component-oriented SoftwareSystems∗

Bernhard Peischl and Joerg Weber and Franz WotawaTechnische Universitat Graz

Institute for Software Technology8010 Graz, Inffeldgasse 16b/2, Austria

Tel: +43 316 873 5723, Fax: +43 316 873 5706peischl,jweber,[email protected]

Abstract

In this paper we introduce a novel techniquefor run-time fault detection and localization incomponent-oriented software systems. Our novelapproach allows to define arbitrary properties viarules at the component level. By monitoring thesoftware system at run-time we can detect viola-tions of these properties and, most notably, alsolocalize possible causes for specific property vio-lation(s). Relying on the model-based diagnosisparadigm, our fault localization technique is able todeal with intermittent fault symptoms and it allowsfor measurement selection. Finally, we discuss re-sults obtained from our most recent case studiesand relate our work to those of others.

1 IntroductionSeveral research areas are engaged in the improvement ofsoftware reliability during the development phase, for ex-ample research on testing, debugging, or formal verifica-tion techniques like model checking. Unfortunately, althoughsubstantial progress has been made in these fields, we haveto accept the fact that faults in complex software systems arefacts to be coped with, not problems to be solved[Pattersonet al., 2002]. This perspective is supported by historical ev-idence and by numerous studies. Thus, it is highly desirableto augment complex software systems with autonomic faultlocalization capabilities, especially in systems which requirehigh reliability.

The goal of our work is to detect and locate faults at run-time without any human intervention. Existing techniqueslike runtime verification aim at the detection of faults. How-ever, it is necessary to locate faults in order to be able toautomatically perform repair at runtime. Possible repair ac-tions are, for example, the restart of software components orswitching to redundant components.

In this paper we propose a technique for runtime fault de-tection and localization in component-oriented software sys-tems. We define components as independent computational

∗This research has been funded in part by the Austrian ScienceFund (FWF) under grant P17963-N04. Authors are listed in alpha-betical order.

modules which have no shared memory and which commu-nicate among each other by the use of events, which can con-tain arbitrary attributes. We suppose that the interactions areasynchronous. A component-oriented software system maybe a single application which comprises loosely coupled pro-cesses (threads), or it may consist of multiple independentapplications which communicate among themselves. Thecomponents may be splitted over a network. Typical imple-mentations of the asynchronous event-based communicationparadigm are, for example, CORBA, COM, JavaBeans, oreven low-level communication methods like Unix messagepassing.

Moreover, we suppose that a certain event which is pro-duced by a component can not be directly related to a specificincoming event, for example because incoming events maybe internally queued. Furthermore, there may be connectionswhich are not observable. Another assumption is that, as of-ten the case in practice, no formalized knowledge about theapplication domain exists.

We require the runtime diagnosis to impose low run-timeoverhead in terms of computational power and memory con-sumption. Ideally, augmenting the software with fault detec-tion and localization functionality necessitates no change tothe software. We require the monitoring process to have nonoticeable influence on the overall behavior of the system.Moreover, to avoid damage which could be caused by a faultysystem, we have to achieve quick detection, localization, andrepair. Another difficulty is the fact that the fault symptomsare often intermittent. One reason is that in runtime diag-nosis the inputs to the system can not be kept constant whilethe diagnosis is performed, as the system continues operating.For example, a server application may receive new client re-quests during the fault localization process. In addition, soft-ware systems often operate in a physical environment whichpermenently changes, e.g. the control software of a mobilerobot.

Our approach allows to introduce user-defined properties.The target system is continuously monitored by rules, i.e.,pieces of software which detect property violations. The factthat the modeler can implement and integrate arbitrary rulesprovides sufficient flexibility to cope with today’s softwarecomplexity. In practice, properties and rules will often em-body elementary insights into the software behavior ratherthan complete specifications. The reason is that, due to the


complexity of software systems, often no formalized knowl-edge of the behavior exists and the informal specifications arecoarse and incomplete.

In order to enable fault localization, complex dependencesbetween properties can be defined. When a violation occurs,we locate the fault by employing the model-based diagno-sis (MBD) paradigm[Reiter, 1987; de Kleer and Williams,1987]. In terms of the classification introduced in[Brusoniet al., 1998], we propose a state-based diagnosis approachwith temporal behavior abstraction. Furthermore, our modelis able to deal with intermittent symptoms.

We evaluated our approach using the control software ofa mobile autonomous robot as target system. The concretemodels which we created for this system mainly aimed atthe diagnosis of severe faults like software crashes and dead-locks.

Among the novel contributions of this paper is the mon-itoring of user-defined properties at the component level byintegrating arbitrary rules. In particular, we employ relation-ships between properties for the localization of faults. Fur-thermore, we provide a formalization of the architecture of acomponent-based software system and of the property depen-dences, and we outline an algorithm for computing the log-ical model. We formally describe the diagnosis system andwe present a runtime fault detection and localization algo-rithm which allows for measurement selection. Moreover, wegive examples related to the control software of autonomousrobots and discuss the results of case studies. Finally, we re-late our work to those of others.

2 Introduction to the Model Framework

(WorldModel)

Vision

Odometry

Kicker

Planner

OM

MD

HB

WSPSWM

connections:

OM ... Object Measurement

MD ... Motion Delta

WS ... World State

PS ... PlannerState

HB ... HasBall

Figure 1: Architectural view on the software system of ourexample.

Figure 1 illustrates a fragment of a control system for anautonomous soccer robot as our running example. This archi-tectural view comprises basically independent componentswhich communicate by asynchronous events. The connec-tions between the components depict data flows.The Vision component periodically produces events contain-ing position measurements of perceived objects. The Odom-etry periodically sends odometry data to the WorldModel(WM). The WM uses probability-based methods for trackingobject positions. For each event arriving at one of its inputs, itcreates an updated world model containing estimated objectpositions. The Kicker component periodically creates eventsindicating whether or not the robot owns the ball. The Planner

maintains a knowledge base (KB), which includes a qualita-tive representation of the world model, and chooses abstractactions based on this knowledge. The content of the KB isperiodically sent to a monitoring application.

In [Steinbauer and Wotawa, 2005] an abstract behav-ior model of software components is proposed which issimilar to the model in[Friedrich et al., 1999]. Thismodel abstracts over concrete values in terms of func-tional dependences[Jackson, 1995]: If we assume a cor-rectly working component and all inputs are correct, thenthe output(s) must be correct as well. In our exam-ple, the Planner component would be modelled as follows:¬AB(Planner)∧ ok(WS)∧ ok(HB) → ok(PS), whereAB(c) denotes abnormality of componentc andok(e) statesthat a connectione is correct during a certain period of time.That is, the model abstracts from both the temporal con-straints and the possibly complex values (logical contents)of events. Whether a connection is correct or not is deter-mined byobservers. An observer comprisesrules, which arepieces of software which monitor certain parts of the softwaresystem. For example, the observers forok(OM), ok(MD),andok(PS), would contain rules which continuously checkif events on these connections are produced periodically.

While the model in[Steinbauer and Wotawa, 2005] provedapplicable in various settings, we argue that in many casesthis model is too abstract to express software behavior. Asa matter of fact, a component’s complex behavior can not becaptured by simple dependences.

First, a separation of temporal constraints and constraintsrelated to the values of events is highly desirable. For exam-ple, in Figure 1, the Planner is supposed to produce eventson the connection PS periodically, regardless of the inputsto this component. Thus this constraint does not depend onany input, and from its violation we can directly infer that thePlanner has failed. However, the value of the PS events is di-rectly influenced by the Planner’s inputs.Second, as events may contain complex values, fine-graineddependences are necessary for capturing the real behavior.For example, some parts of the knowledge base, whose con-tent is transmitted over the PS connection, depend on theworld model, i.e. on the WS connection, while other partsdepend on the HB connection.

Our new model addresses these issues by assigning a set ofproperties, i.e. constraints, to components and connections.In the logical model, these properties are represented byprop-erty constants. We use the propositionok(x, pr, s) whichstates that the propertypr holds forx during a certain pe-riod of time, wherex is either a component or a connection.While the system is continuously monitored by the rules, thediagnosis itself is based on (multiple) discrete snapshots ofthe system. The snapshots are obtained by polling the statesof the rules (violated or not violated) at discrete time points.Each observation belongs to a certain snapshot, and we usethe variables as a placeholder for a specific snapshot. Thediagnosis accounts for the observations of all snapshots. Thisapproach to MBD is called multiple-snapshot diagnosis orstate-based diagnosis[Brusoniet al., 1998].

An example for a component-related property isprnp, ex-pressing that the number of processes (threads) of a correctly


(WorldModel)

Vision

Odometry

Kicker

Planner

ok(OM, pr_pe)

ok(MD, pr_pe)

ok(HB, pr_pe)

ok(WS, pr_pe) ok(PS, pr_pe)

WM

ok(WS, pr_cons_OM)

ok(PS, pr_cons_OM)

ok(PS, pr_cons_HB)

Figure 2: The improved model allows to define properties foreach connection.

working componentc must exceed a certain threshold. In ourrunning example,prpe denotes that events must occur period-ically on a connection, andprcons e is used to denote that thevalue of events on a certain connection must not contradictthe events on connectione.

The observer forok(WS, prcons OM , s) checks if the com-puted world models on connection WS correspond to theobject position measurements on connection OM. Ideally,such an observer would embody a complete specificationof the tracking algorithm used in the WM component. Inpractice, however, often only incomplete and coarse spec-ifications of the complex software behavior are available.Therefore, the observers rely on simple insights which re-quire little expert knowledge. The rules of the observer forok(WS, prcons OM , s) could check if all environment objectswhich are perceived by the Vision are also part of the com-puted world models, disregarding the actual positions of theobjects (note that the set of perceived objects often changesin a dynamic environment, and those objects which are nolonger perceived will be tracked by the WM for a while andfinally discarded). Our experience has shown that such ab-stractions often suffice to detect and locate severe faults likesoftware crashes or deadlocks.

Using such properties, the dependences between the in-puts and outputs of components can be refined, as the logi-cal model in Figure 3 shows. Figure 2 depicts the propertieswhich we assign to the connections, and Figure 4 shows thedependences between properties on the input and output con-nection of the WM and the Planner.

The model captures, for example, that the WM must gener-ate events periodically, provided that the temporal constraintson the incoming connections hold. Furthermore, the value ofthe events on connection WS must be consistent with the OMconnection, provided that the events on OM occur periodi-cally.

To illustrate our basic approach we outline a simple sce-nario by locating the cause for observed malfunctioning. Weassume a fault in the WM causing the world state WS and,as a consequence, the planner statePS to become inconsis-tent with the object position measurementsOM . As a re-sult, the observer forok(PS, prcons OM , s) detects a viola-tion, i.e.¬ok(PS, prcons OM , s0) is an observation for snap-shot 0. All other observers are initially disabled, i.e. they donot provide any observations.Based on this observation, we can compute diagnosis can-didates by employing the MBD[Reiter, 1987; de Kleer and

¬AB(V ision)→ ok(OM, prpe, s)¬AB(Odometry)→ ok(MD, prpe, s)¬AB(WM) ∧ ok(OM, prpe, s) ∧ ok(MD, prpe, s)

→ ok(WS, prpe, s)¬AB(WM) ∧ ok(OM, prpe, s)→ ok(WS, prcons OM , s)¬AB(Kicker)→ ok(HB, prpe, s)¬AB(Planner)→ ok(PS, prpe, s)¬AB(Planner) ∧ ok(WS, prcons OM , s)

→ ok(PS, prcons OM , s)¬AB(Planner)→ ok(PS, prcons HB , s)for each componentc : ¬AB(c)→ ok(c, prnp, s)

Figure 3: An improved model with refined dependences forour example.

Williams, 1987] approach for this observation snapshot. Bycomputing all (subset minimal) diagnoses, we obtain threesingle-fault diagnosis candidates, namelyAB(V ision),AB(WM), andAB(Planner). Note that, using thecoarse-grained model in[Steinbauer and Wotawa, 2005], theOdometry and the Kicker would be candidates, too.

After activating observers for the output connectionsof these candidates, we obtain the second observa-tion snapshot ok(OM, prpe, s1), ok(WS, prpe, s1),¬ok(WS, prcons OM , s1), ok(PS, prpe, s1), ¬ok(PS,prcons OM , s1), andok(PS, prcons HB , s1). This leads tothe single diagnosisAB(WM).

Let us consider a second scenario related to Figure 2. Itdemonstrates that our model framework allows refinementswhich may lead to the correct identification of multiple-faultdiagnoses in situations in which a less fine-grained modelwould find solely single-fault diagnoses.

We assume that monitoring the connectionsOM andMD is either impossible or unrealistic due to highcosts. Suppose that no events on connection WS oc-cur, thus ¬ok(WS, prpe, s0) is observed. Given themodel in Fig. 3, we obtain 3 single-fault diagnoses:AB(V ision), AB(Odometry), AB(WM). How-ever, the WM generates an output event for each event onone of its incoming connections. Therefore, if only one ofthe components Vision or Odometry were faulty, the numberof events on the connection WS would still be larger than 0.As a consequence, we can conclude that either the WM orboth the Vision and the Odometry have failed.

We gain a better result by refining the model as shown inFigure 5. The sentences in Figure 5 extend the model in Fig-ure 3. The new propertypreo holds only if at least one eventoccurs on a connection during a certain time period. A newsentence is added to the model of the WM component. Itstates that, if the WM works correctly and the propertypreoholds for at least one input connection, then it must hold forthe connection WS as well. Note that we use a kind of depen-dence forpreo that is different from what we have seen so far.We will call this apartial dependencein Section 3.

Now the observers detect two property violations:¬ok(WS, prpe, s0) and ¬ok(WS, preo, s0). We obtaina single-fault diagnosisAB(WM) and a single dual-fault diagnosisAB(V ision), AB(Odometry), which ob-


viously resembles the human kind of reasoning.

ok(OM, pr_pe)

ok(MD, pr_pe)

ok(WS, pr_pe)

ok(WS, pr_cons_OM)

ok(WM, pr_np)

ok(PS, pr_pe)

ok(PS, pr_cons_OM)

ok(PS, cons_HB)

ok(WS, pr_cons_OM)

ok(Planner, pr_np)

WM

Planner

prop. depends on component only

prop. also depends on input

Figure 4: Graphical representation of dependences in Fig. 3for two example components.

¬AB(V ision)→ ok(OM, preo, s)¬AB(Odometry)→ ok(MD, preo, s)¬AB(WM) ∧ (ok(OM, preo, s) ∨ ok(MD, preo, s))→ ok(WS, preo, s)

Figure 5: Extension of the model in Fig. 3.

3 Formalizing the Model FrameworkIn Definition 3.1 we introduce a model which captures thearchitecture of a component-oriented software system and thedependences between properties.

Definition 3.1 (SAM) An software architecture model(SAM) is a tuple (COMP,CONN,Φ, ϕ, out, inp, int)with:

• a set of componentsCOMP

• a set of connectionsCONN

• a (finite) set of propertiesΦ

• a functionϕ : COMP ∪ CONN 7→ 2Φ, assigningproperties to a given component or connection.

• a functionout : COMP 7→ 2CONN , returning the out-put connections for a given component.

• the (partial) functions inp andint : COMP ×CONN × Φ 7→ 2CONN×Φ, which express the func-tional dependences between the inputs and outputs ofa given componentc. For all output connectionse ∈out(c) and for each propertypr ∈ ϕ(e), they return aset of tuples(e′, pr′), wheree′ is an input connection ofc andpr′ ∈ ϕ(e′) a property assigned toe′.

This definition allows to specify a set of propertiesΦ for aspecific software system. We introduce a functionϕ in orderto assign properties to components and connections.

The functionsinp andint formalize the functional depen-dences between properties of the inputs and of the outputs.For each propertypr of an output connection, they return aset of input propertiesPR′ on whichpr depends. Functionint expressestotal dependences: if a component is correctand all properties inPR′ hold, thenpr must hold as well. Bycontrast,inp definespartial dependences: if a component iscorrect and at least one property inPR′ holds, thenpr musthold, too. In our example, the dependence of(WS, preo) on(OM, preo) and(MD, preo) is partial (see Fig. 5). All otherdependences in this example are total.

Note that [Friedrich et al., 1999] and [Steinbauer andWotawa, 2005] use only one kind of functional dependence,which is equivalent to what we call a total dependence herein.

For example, those part of the SAM which relates to theWM component and its output connection WS are defined asfollows (Fig. 3 and 5):

ϕ(WM) = prnp, ϕ(WS) = prpe, prcons OM , preoout(WM) = WSint(WM,WS, prpe) = (OM, prpe), (MD, prpe),int(WM,WS, prcons OM ) = (OM, prpe)inp(WM,WS, preo) = (OM, preo), (MD, preo)The logical model is computed by Algorithm 1. Based on

a SAM, it generates the logical system descriptionSD. Inline (3), we create those sentences which relate to componentproperties. In line (4), a logical representation of the depen-dences between properties is computed. It is distinguishedbetween total and partial dependences.

Note that the universal quantification implicitly applies tovariables. It denotes a discrete snapshot of the system be-havior. Each observation(¬)ok(x, pr, si) relates to a certainsnapshotsi, wherei is the snapshot index. A diagnosis is asolution for all snapshots. The temporal ordering of the dif-ferent snapshots is not taken into account.

It is also important that, supposed that the number of snap-shots is finite, the logical model which is computed by thisalgorithm can be easily transformed to propositional Hornclauses and thus the model is amenable to efficient logicalreasoning.

4 Runtime Monitoring and Fault LocalizationThe runtime diagnosis system consists of two modules, thediagnosis module (DM)and theobservation module (OM).These modules are executed concurrently. While the DM per-forms runtime fault detection and localization at the logicallevel, the OM continuously monitors the software system andprovides the abstract observations which are used by the DM.Thus, the OM can be regarded as an abstraction layer betweenthe architecture model, as presented in Section 3, and the run-ning software.

Let us consider the OM first. It basically consists of ob-servers. Each observer comprises a set of rules which specifythe desired behavior of a certain part of the software system.A rule is a piece of software which continuously monitorsthat part. The execution of the rules is concurrent and unsyn-chronized, and we do not impose any restrictions on the im-plementation of a rule and its complexity. Furthermore, while


Algorithm 1: The algorithm for computing the logicalmodel.Input: The SAM.Output: The system descriptionSD.COMPUTEMODEL(COMP,CONN,Φ, ϕ, out, inp, int)(1) SD := .(2) For allc ∈ COMP :(3) For allpr ∈ ϕ(c): add ¬AB(c) → ok(c, pr, s) to

SD.(4) For alle ∈ out(c), for all pr ∈ ϕ(e): add

¬AB(c)∧∧

(e′, pr′) ∈ int(c, e, pr)ok(e′, pr′, s)→ ok(e, pr, s)

and

¬AB(c)∧∨

(e′, pr′) ∈ inp(c, e, pr)

ok(e′, pr′, s)→ ok(e, pr, s)

to SD.(5) ReturnSD.

a property is assigned to a single component or connection, arule may monitor multiple communication links in the targetsystem in order to detect wrong sequences of events. In ourexample, the rules forok(WS, prcons OM , s) take the eventson the connections WS and OM into account.

When a rule detects a violation of its specification, itswitches from statenot violatedto the stateviolated. To eachobserver a set of atomic sentences is assigned which representthe logical observations.

Furthermore, an observer may be enabled or disabled. Therules of a disabled observer are inactive, and the observerdoes not provide any observations. Disabled observers maybe enabled in the course of the fault localization. Note thatit is often desired to initially disable those observers whichotherwise would cause unnecessary runtime overhead.

Definition 4.1 (Observation ModuleOM ) The OM is atuple (OS,OSe), whereOS is the set of all available ob-servers andOSe ⊆ OS the set of those observers which arecurrently enabled.

Definition 4.2 (Observer) An observeros ∈ OS is a tuple(R,Ω) with:

1. a set of rulesR. For a ruler ∈ R, the boolean functionviolated(r) returns if a violation of its specification hasbeen detected.

2. A set of atomic sentencesΩ. Each atomω ∈ Ω hasthe formok(x, pr, s), wherex ∈ COMP ∪ CONN ,pr ∈ ϕ(x), ands is a variable denoting an observationsnapshot (see Definition 3.1).

An observer detects a misbehavior if one or more of itsrules are violated. Letυ(OSe) denote the set of observerswhich have detected a misbehavior, i.e.υ(OSe) = (R,Ω) ∈OSe | violated(r) = true, r ∈ R. Then the total set ofobservations of a certain snapshotsi as computed as shownin Algorithm 2.

Algorithm 2: The algorithm for computing the set of obser-vations.Input: The set of enabled observers and a constant denotingthe current snapshot.Output: The setOBS which comprises ground literals.COMPUTEOBS(OSe, si)(1) OBS := .(2) For allos ∈ OSe, os = (R,Ω):(3) If os ∈ υ(OSe): add

∧ω∈Ω ¬ω toOBS

(4) else: add∧ω∈Ω ω toOBS.

(5) For all atomsα ∈ OBS: substitutesi for the variables.(6) ReturnOBS.

Algorithm 3 presents the algorithm which is executed bythe diagnosis module DM. The inputs to the algorithm arethe logical system descriptionSD, which is returned by thecomputeModelalgorithm (Alg. 1), and an observation moduleOM = (OS,OSe). In contrast to the work in[Steinbauerand Wotawa, 2005], this algorithm is able to gather additionalobservations by integrating runtime measurement selection.

The algorithm periodically determines whether a misbe-havior is detected by an observer. In this case, it waits fora certain period of time (line 6). This gives the observersthe opportunity to detect additional symptoms, as it may takesome time after faults manifest themselves in the observedsystem behavior. Thereafter, the diagnoses are computed(line 10) using Reiter’s Hitting Set algorithm[Reiter, 1987].

Note that the violated rules are reset tonot violatedaftercomputing the logical observations (line 9). Therefore, anobserver which detects a misbehavior in snapshotsj may re-port a correct behavior insj+1. This is necessary for the lo-calization of multiple faults in the presence of intermittentsymptoms.

When we find several diagnoses (lines 11 and 12), it isdesirable to enable additional observers inOS \ OSe. Weassume the functionms(SD,OBS,OS,OSe) to perform ameasurement selection, i.e. it returns a set of observersOSs(OSs ⊆ OS \ OSe) whose observations could lead to a re-finement of the diagnoses. We do not describe the functionms in this paper. In[de Kleer and Williams, 1987] a strat-egy based on Shannon entropy to determine the optimal nextmeasurement is discussed. Note that the returned set may beempty, even if no unique diagnosis is derivable.

The fault localization is finished when either a unique di-agnosis is found or the diagnoses can not be further refinedby enabling additional observers (line 11).

5 Case Studies and DiscussionWe implemented the proposed diagnosis system and con-ducted a series of experiments using the control software ofa mobile autonomous soccer robot. We applied a proposi-tional Horn clause theorem prover for consistency checks inthe diagnosis engine[Minoux, 1988]. The implemented mea-surement selection process may enable multiple observers atthe same time in order to reduce the time required for faultlocalization.

The components of the control system are executed in


Algorithm 3: The runtime diagnosis algorithm.Input: The logical system description and the observationmodule.PERFORMRUNTIMEDIAGNOSIS(SD,OM )(1) Do forever:(2) Query the observers, i.e. compute the setυ(OSe).(3) If υ(OSe) 6= :(4) Seti := 0, OBS := , finished := false,

wherei is the snapshot index.(5) Whilenot finished:(6) Wait for the symptom collection periodδc.(7) Recomputeυ(OSe).(8) OBS := OBS ∪ OBSi, whereOBSi :=

computeOBS(OSe, si)(9) Reset all rules tonot violated.(10) ComputeD: D := ∆|∆ is a minimal

diagnosis of (SD, COMP, OBS).(11) If |D| = 1 or the setOSs := ms(SD,OBS,

OS,OSe) is empty: start repair, setfinished := true.

(12) Otherwise: seti := i + 1, enable observers inOSs, and setOSe := OSe ∪OSs.

separate applications which interact among each other usingCORBA communication mechanisms. The software runs ona Pentium 4 CPU with a clock rate of 2 GHz. The model ofthe software system comprises 13 components and 14 con-nections. We introduced 13 different properties. 7 differenttypes of rules were implemented, and the observation moduleused 21 instances of these rule types.

For the specification of the system behavior we used sim-ple rules which embody elementary insights into the softwarebehavior. For example, we specified the minimum number ofprocesses spawned by certain applications. Furthermore, weidentified patterns in the communication among components.A simple insight is the fact that components of a robot controlsystem often produce new events either periodically or as aresponse to a received event. Other examples are rules whichexpress that the output of a component must change when theinput changes, or specifications capturing the observation thatthe values of certain events must vary continuously.

We simulated software failures by killing single processesin 10 different applications and by injecting deadlocks inthese applications. We investigated if the faults can be de-tected and located in case the outputs of these components areobserved. In 19 out of 20 experiments, the fault was detectedand located within less than 3 seconds. In only one case it wasnot possible to detect the fault because the specification of animportant connection would have required information aboutthe physical environment which was not available. Note thatwe set the symptom collection periodδc to 1 second (see Alg.3, line 6), and the fault localization incorporated no more than2 observation snapshots.

Due to the small number of components and connections,the computation of the diagnoses required only a few mil-liseconds. Furthermore, the overhead (in terms of CPU loadand memory usage) caused by the runtime monitoring wasnegligible, in particular because calls to the diagnosis engine

are only necessary after an observer has detected a misbehav-ior.

Furthermore, we conducted 6 non-trivial case studies in or-der to investigate more complex scenarios. We injected dead-locks in different applications. We assumed that significantconnections are either unobservable or should be observedonly on demand, i.e. in course of the fault localization, be-cause otherwise the runtime overhead would be unacceptable.In 4 scenarios we injected single faults, while in the othercases 2 faults occurred in different components almost at thesame time. Moreover, in 2 scenarios the symptoms were in-termittent and disappeared during the fault localization.

In all of the 6 case studies, the faults could be correctlydetected and located. In two cases, the fault was immedi-ately detected and then located within 2 seconds. In one casethe fault was detected after about 5 seconds, and the localiza-tion took 2 more seconds. However, in three case studies thesimple rules detected the faults only in certain situations, e.g.when the physical environment was in a certain state. Forexample, in one case study the fault could be detected onlywhen it occured while the soccer robot was dribbling the ball.The fault was not detected in situations in which the robot didnot have the ball.

We gained several insights from our experiments. In gen-eral, state-based diagnosis appears to be an appropriate ap-proach for fault localization in a robot control system as a par-ticular example for component-oriented software. We wereable to identify simple patterns in the interaction among thecomponents, and by using rules which embody such patternsit was possible to create appropriate models which abstractfrom the dynamic software behavior. Furthermore, the ap-proach proved to be feasible in practice since the overheadcaused by the runtime monitoring is low.

An important issue is how to find the properties for a spe-cific application. Our work aims at software systems com-prising large and complex components. At present, for suchsystems it is rarely the case that formal specifications areavailable. Thus, it will often be necessary to manually derivethe properties from informal (textual and graphical) specifi-cations, which are often coarse and incomplete.

It would be desirable to automatically extract the propertiesfrom the source code of the software system, for example byrelying on assertions (Design by Contract,[Meyer, 1997]).Unfortunately, in general this is not possible due to severalreasons. First, only a part of the source code of a software sys-tem may be available, especially in complex systems whichoften integrate third party frameworks and libraries. Second,we cannot expect that the automated extraction of propertiesis computationally feasible for complex systems. The gran-ularity of properties at the component level is quite differentfrom that of assertions at the source code level, as propertiesrelate to the overall behavior of a component whereas asser-tions are assigned to functions and classes in the source codeand thus define local conditions. Therefore, in order to de-rive a single property automatically it would, in general, benecessary to take the entire source code (including all asser-tions) into account, which is computationally infeasible forlarge systems.

A main problem is the fact that simple rules are often too


coarse to express the software behavior. Such rules may de-tect faults only in certain situations. Therefore, it may happenthat faults are either not detected or that they are detected toolate, which could cause damage due to the misbehavior of thesoftware system. The usage of simple rules also has the effectthat more connections must be permanently observed than itwould be the case if more complex rules were used. For ex-ample, in the control system we used in our experiments wehad to observe more than half of the connections permanentlyin order to be able to detect severe faults like deadlocks inmost of the components.

6 Related ResearchThere is little work which deals with model-based runtimediagnosis of software systems. In[Grosclaude, 2004] an ap-proach for model-based monitoring of component-based soft-ware systems is described. The external behavior of compo-nents is expressed by Petri nets. In contrast to our work, thefault detection relies on the alarm-raising capabilities of thecomponents themselves and on temporal constraints.

In the area of fault localization in Web Services, the authorof [Ardissonoet al., 2005] proposes a modelling approachwhich is similar to ours. Both approaches use grey-box mod-els of components, i.e. the dependences between the inputsand outputs of components are modelled. However, theirwork assumes that each message (event) on a component out-put can be directly related to a certain input event, i.e. eachoutput is a response which can be related to a specific incom-ing request. As we can not make this assumption, we abstractover a series of events within a certain period of time.

Another approach to model the behavior of software ispresented in[Mikaelian and Williams, 2005]. In order todeal with the complexity of software, the authors proposeto use probabilistic, hierarchical, constraint-based automata(PHCA). However, their work addresses software which isembedded in hardware systems, and they model the softwarein order to detect faults in the hardware. The authors of[Mikaelian and Williams, 2005] do not detect software bugs.

In the field of autonomic computing, there are model-basedapproaches which aim at the creation of self-healing and self-adaptive systems. The authors of[Garlan and Schmerl, 2002]propose to maintain architecture models at runtime for prob-lem diagnosis and repair. Their architecture models comprisecomponents and connectors. Their notion of a component re-sembles our definition. Similar to our work, they assign prop-erties to components and connectors. The constraints overthe properties are defined in a first-order language. However,their work does not employ fault localization mechanisms.

Pinpoint[Chenet al., 2002] is a framework for root-causeanalysis in large distributed component applications (e.g.e-commerce systems). Pinpoint monitors client requests,uses traffic sniffing and middleware instrumentation to de-tect failed requests, and then applies data mining techniquesto determine which components are likely to be faulty. Theadvantage of their approach is that is does not rely on staticdependency models. No knowledge of the application com-ponents is required.

The author of[Auguston, 1998] suggests an approach to

assertion checking, debugging, and profiling by building a be-havioral model in terms of a number of events (so called eventtraces). Moreover, the author proposes a language to describecomputations over event traces and states that algorithmic de-bugging[Shapiro, 1983] can be considered as an example ofa debugging strategy based on a specific assertion language(e.g. assertions about procedure call outcomes). Moreover,the authors of[Consoleet al., 1993] discuss the relationshipbetween algorithmic debugging and MBD.

In contrast to the work presented herein, research in thearea of model-based software debugging deals with verifica-tion and particularly fault localization[Mayer and Stumptner,2003; Kob and Wotawa, 2004] at compile time. Since limita-tions on computational and memory resources are less strin-gent than in runtime diagnosis, most of this research dealswith fault localization at the object, statement or expressionlevel, whereas our model focuses on capturing component-level behavior.

Design by Contract[Meyer, 1997] is a lightweight formaltechnique for runtime detection of specification violations.The trace assertions approach[Brokens and Moller, 2002b;2002a] extends the Design by Contract approach by speci-fying the desired behavior of a program in terms of CSP-like processes. This allows for specifying valid events in asystematic fashion also incorporating abstraction techniques.Similar to our approach these so-called trace assertions arechecked at runtime. However, the work presented in[Brokensand Moller, 2002b; 2002a] focuses on runtime error detectionin Java programs and on specification techniques for traces.Our work fits in the same context, however, we focus on faultdetection as well as localization in particular in autonomoussoftware systems.

The authors of[Steinbauer and Wotawa, 2005] discuss therepair of component-oriented software systems at runtime.The repair is basically done by restarting failed components.

7 Conclusion and Future ResearchThis paper presents a model-based diagnosis approach forthe detection and localization of faults in component-orientedsoftware systems at runtime. Our model allows to introducearbitrary properties and to assign them to components andconnections. The fault detection is performed by rules, i.e.pieces of software which continuously monitor the softwaresystem in order to detect property violations. The fault lo-calization utilizes dependences between properties. We for-malize the architecture of a component-oriented software sys-tem and the dependences between properties. We employ twodifferent kinds of dependences, total dependences and partialdependences.

Moreover, we provide algorithms for the generation ofthe logical model and for the runtime diagnosis. The run-time fault localization integrates measurement selection byenabling additional observers at runtime.

Finally, we discuss case studies which demonstrate thatour approach is frequently able to quickly detect and locatefaults. We were able to create appropriate models which ab-stract from the dynamic behavior by relying on simple ruleswhich embody elementary insights into the software system.


The main problem we identified is the fact that simple rules,in contrast to more complex specifications, often detect faultsonly in certain situations. As a consequence, it may happenthat faults are either not detected or they are detected too late.

We plan to evaluate our approach in other application do-mains as well. Another open issue is if our approach canbe adapted to a distributed diagnostic engine, which wouldbe usefor for software systems which are distributed over anetwork. Moreover, our future resarch will deal with au-tonomous repair of software systems at runtime.

References[Ardissonoet al., 2005] Liliana Ardissono, Luca Console,

Anna Goy, Giovanna Petrone, Claudia Picardi, MarinoSegnan, and Daniele Theseider Dupre. CooperativeModel-Based Diagnosis of Web Services. InProceedingsof the 16th International Workshop on Principles of Diag-nosis, DX Workshop Series, pages 125–132, June 2005.

[Auguston, 1998] Mikhail Auguston. Buildig program be-havior models. InProceedings of the European Confer-ence on Artificial Intelligence (ECAI), Workshop on Spa-tial and Temporal Reasoning, pages 19–26. IOS Press,1998.

[Brokens and Moller, 2002a] Mark Brokens and MichaelMoller. Dynamic event generation for runtime checkingusing the JDI. In Klaus Havelund and Grigore Rosu, ed-itors, Proceedings of the Federal Logic Conference Satel-lite Workshops, Runtime Verification, volume 80 ofElec-tronic Notes in Theoretical Computer Science. Elsevier,July 2002.

[Brokens and Moller, 2002b] Mark Brokens and MichaelMoller. Jassda Trace Assertions, Runtime checking the dy-namic of Java programs. In Ina Schieferdecker, HartmundKonig, and Adam Wolisz, editors,Trends in Testing Com-municating Systems, International Conference on Testingof Communicating Systems, pages 39–48, March 2002.

[Brusoniet al., 1998] Vittorio Brusoni, Luca Console, PaoloTerenziani, and Daniele Theseider Dupre. A spectrum ofdefinitions for temporal model-based diagnosis.ArtificialIntelligence, 102(1):39–79, 1998.

[Chenet al., 2002] M. Chen, E. Kiciman, E. Fratkin, A. Fox,and E. Brewer. Pinpoint: Problem determination in large,dynamic Internet services. InProceedings of the Inter-national Symposion on Dependable System and Networks(DSN), June 2002.

[Consoleet al., 1993] Luca Console, Gerhard Friedrich, andDaniele Theseider Dupre. Model-based diagnosis meetserror diagnosis in logic programs. InProceedings13thInternational Joint Conf. on Artificial Intelligence, pages1494–1499, Chambery, August 1993.

[de Kleer and Williams, 1987] Johan de Kleer and Brian C.Williams. Diagnosing multiple faults.Artificial Intelli-gence, 32(1):97–130, 1987.

[Friedrichet al., 1999] Gerhard Friedrich, Markus Stumpt-ner, and Franz Wotawa. Model-based diagnosis of hard-

ware designs. Artificial Intelligence, 111(2):3–39, July1999.

[Garlan and Schmerl, 2002] David Garlan and BradleySchmerl. Model-based adaptation for self-healing sys-tems. InWOSS ’02: Proceedings of the first workshop onSelf-healing systems, pages 27–32, New York, NY, USA,2002. ACM Press.

[Grosclaude, 2004] Irene Grosclaude. Model-based monitor-ing of software components. InProceedings of the 16thEureopean Conference on Artificial Intelligence, pages1025–1026. IOS Press, June 2004. Poster.

[Jackson, 1995] Daniel Jackson. Aspect: Detecting Bugswith Abstract Dependences.ACM Transactions on Soft-ware Engineering and Methodology, 4(2):109–145, April1995.

[Kob and Wotawa, 2004] Daniel Kob and Franz Wotawa. In-troducing alias information into model-based debugging.In 16th European Conference on Artificial Intelligence(ECAI), pages 833–837, Valencia, Spain, August 2004.IOS Press.

[Mayer and Stumptner, 2003] Wolfgang Mayer and MarkusStumptner. Extending diagnosis to debug programs withexceptions. InProceedings of the 18th IEEE InternationalConference on Automated Software Engineering (ASE),Montreal, Quebec, Canada, 10 2003. IEEE.

[Meyer, 1997] B. Meyer. Object-Oriented Software Con-struction. OSE Press, 2nd edition, 1997.

[Mikaelian and Williams, 2005] Tsoline Mikaelian andBrian C. Williams. Diagnosing complex systems withsoftware-extended behavior using constraint optimization.In Proceedings of the 16th International Workshop onPrinciples of Diagnosis, DX Workshop Series, pages125–132, 2005.

[Minoux, 1988] Michel Minoux. LTUR: A SimplifiedLinear-time Unit Resolution Algorithm for Horn Formu-lae and Computer Implementation.Information Process-ing Letters, 29:1–12, 1988.

[Pattersonet al., 2002] David Patterson, Aaron Brown, PeteBroadwell, George Candea, Mike Chen, James Cutler, Pa-tricia Enriquez, Armando Fox, Emre Kiciman, MatthewMerzbacher, David Oppenheimer, Naveen Sastry, WilliamTetzlaff, Jonathan Traupman, and Noah Treuhaft. Recov-ery Oriented Computing (ROC): Motivation, Definition,Techniques, and Case Studies. Technical report, Berkeley,CA, USA, 2002.

[Reiter, 1987] Raymond Reiter. A theory of diagnosis fromfirst principles.Artificial Intelligence, 32(1):57–95, 1987.

[Shapiro, 1983] Ehud Shapiro.Algorithmic Program Debug-ging. MIT Press, Cambridge, Massachusetts, 1983.

[Steinbauer and Wotawa, 2005] Gerald Steinbauer andFranz Wotawa. Detecting and locating faults in the controlsoftware of autonomous mobile robots. InProceedings ofthe19th International Joint Conference on AI (IJCAI-05),pages 1742–1743, Edinburgh, UK, 2005.


Abstract Dependence Models in Software Debugging

Bernhard Peischl and Safeeullah Soomro and Franz Wotawa ∗

Technische Universitat GrazInstitute for Software Technology

8010 Graz, Inffeldgasse 16b/2, Austriapeischl,ssoomro,[email protected]

Abstract

In this article we introduce and formalize a novelmodel particularly tailored for detecting and lo-calizing structural faults in procedural programs.Moreover, we discuss the relationship between thismodel and the well-known functional-dependencemodel particularly under presence of partial spec-ification artifacts like assertions or pre- and postconditions. Furthermore, we present novel resultsobtained from our most recent case study. Notably,whenever our novel model detects a structural fault,it also appears to be capable of localizing the de-tected misbehavior’s real cause.

1 IntroductionAbstract dependences [Jackson, 1995] are applied in softwareanalysis in various ways (e.g. software maintenance, programunderstanding, program slicing, refactoring, and also in soft-ware debugging). Fault localization employing abstract de-pendences has sound theoretical foundations [Friedrich et al.,1999] and the relationships to other techniques in softwareengineering, for example program slicing [Weiser, 1984],have been clarified [Wotawa, 2002]. Specifically in MBSD(model-based software debugging), we are aware of two dif-ferent models relying on abstract dependences. Both modelshave individual strengths, and weaknesses, and, as outlinedin this article, appear to complement each other in terms oftheir diagnostic capabilities.

The so called functional-dependence model (FDM) repre-sents program statements as components and completely ab-stracts form individual values, referring to variables merelyas being correct or incorrect with regard to a given expectedbehavior. For example, this representation allows for debug-ging VHDL designs up to 10MB of source code [Friedrichet al., 1999]. The FDM captures the program’s behavior bystating, that whenever a statement is correct, and its inputs areknown to be correct, then the output must be correct, too. Fora detailed treatment of the FDM we refer to [Friedrich et al.,1999].

∗We listed authors in alphabetical order. The Austrian ScienceFund (FWF) supports this work under project grant P17963-N04.The Higher Eductaion Comission (HEC), Pakistan, supports thiswork under its scholarship program.

The verification-based model (VBM) for debugging is anextension of the dependence model from Jackson’s Aspectsystem [Jackson, 1995] which has been used for verificationof C programs. The Aspect system analyzes the dependencesbetween variables of a given program and compares themwith the specified dependences. In case of a mismatch theprogram is said to violate the specification. Otherwise, theprogram fulfills the specification. Unfortunately, the Aspectsystems does not allow to locate the source of a mismatch.The VBM extends Jackson’s idea towards not only detect-ing misbehavior but also localizing the malfunctioning’s realcause.

In this article we (1) provide a formalization of the VBM,(2) outline a novel model extension allowing for debuggingof procedural programs, (3) discuss the relationship betweenthe FDM and VBM under presence of partial specificationslike assertions by exemplifying specific scenarios in softwaredebugging, and (4) present novel results from our most recentcase study.

2 The Verification-Based ModelIn the following we explain the basic ideas using the follow-ing small program which implements the computation of thecircumference and area of a circle. The program contains onefault in line 2 where a multiplication by Π is missing.

0. // pre true1. d = r * 2;2. c = d; // BUG! a = d * pi;3. a = r * r * pi;4. // post c = r2 · π ∧ a = 2 · r · π

Informally, a program variable x depends on a variable yif a value for y potentially influences the value of x. In oursmall example program the variable d in the first line dependson the variable r. Hence, every statement of a program intro-duces new dependences. All defined variables of an assign-ment statement , i.e., variables occurring on the left side, de-pend on all variables of the right side of an assignment state-ment.

Similar rules can be obtained for other statements as ex-plained later on. These dependences solely are given by astatement whenever we assume that the statement is correct(w.r.t. the dependences). If a statement is assumed to be in-correct, the dependences are not known. We express the lat-ter fact by introducing a new type of variable, the so calledmodel variables. Model variables are variables that work as


placeholder for program variables. For example, if we as-sume statement 2 to be incorrect, we introduce a model thatsays that program variable a depends on model variable ξ2

(where ξ2 is unique).The idea behind our approach is to find assumptions about

the correctness and incorrectness of statements which do notcontradict a given specification. In our running example, thespecification is given in terms of a post-condition. From thispost-condition we derive that c has to depend on r and pi.However, when assuming statement 1 and 2 to be correct, wederive that a depends on d and d in turn depends on r whichleads to c depends on r but not on pi. Hence, the computeddependence contradicts the specified one.

To get rid of this inconsistency, we might assume line 2 tobe faulty. Hence, we can compute that c depends on modelvariable ξ2. When now comparing the specification with thecomputed dependence we substitute ξ2 by r and pi and wecan no longer derive an inconsistency.

In rest of this section we formalize the basic idea. We startwith the definition of dependences.

The interpretation of a dependence (x, y) of a relation R ∈D is that x depends on y. The dependence relations for everyline of our small program are:

1. d = r * 2; r1 = (d, r)2. a = d; r2 = (a, d)3. c = r * r * pi; r3 = (c, r), (c, pi)

Our novel debugging model allows one for reasoning aboutfunctions over dependence relations under given assump-tions. Therefore the notion of a dependence relation is funda-mental to our approach:Definition 2.1 (Dependence Relation) Given a pro-gram with variables V , and a set of model variablesM = ξ1, . . .. A dependence relation is a subset of the setD = 2V×(M∪V ).

For combining the dependences of two consecutive state-ments we define the following composition operator for de-pendence relations.Definition 2.2 (Composition) Given two dependence rela-tions R1, R2 ∈ D on V and M . The composition of R1 andR2 is defined as follows:

R1 •R2 =(x, y)|(x, z) ∈ R2 ∧ (z, y) ∈ R1∪(x, y)|(x, y) ∈ R1∧ 6 ∃(x, z) ∈ R2∪(x, y)|(x, y) ∈ R2∧ 6 ∃(y, z) ∈ R1

This definition ensures that no information is lost duringcomputing the overall dependence relation for a procedure ormethod. Hence, the first line of the definition of compositionhandles the case where there is a transitive dependence. Thesecond line states that all dependences that are not re-definedin R2 are still valid. In the third line all dependences that aredefined in R2 are in the new dependence set provided thatthere is no transitivity relation. Note that this composition isnot a commutative operation and that is the identity ele-ment of composition.

For example, the combined dependences for our runningexamples are: r1 • r2 = (a, r), (d, r) = r′ and r′ • r3 =(a, r), (d, r), (c, r), (c, pi) = r′′

In order to allow the direct comparison of specified depen-dences with the computed ones we introduce a projection op-erator which deletes all dependences for variables that are notof interest like the internal variable d.

Definition 2.3 (Projection) Given a dependence relationsR ∈ D and a set of variables A ⊆M ∪ V . The projection ofR on A written as ΠA(R) is defined as follows:

ΠA(R) = (x, y)|(x, y) ∈ R ∧ x ∈ AFor example, Πa,c(r′′) is (a, r), (c, r), (c, pi) which

is equivalent to the specification.From here on we assume that the computed dependence

relation is always projected onto the variables used within thespecification before comparing it with the specification.Definition 2.4 (Grounded dependence relation) A depen-dence relation is said to be grounded (or variable-free) if itcontains no model variables.

We assume that all specification are grounded dependencerelations. Thus, we have to compare dependence relationscontaining model variables with grounded dependence rela-tions. We propose a similar solution to that employed in theresolution calculus of first-order logic, namely substitutionand finding the most general unifier. However, in contrast tovariable substitution in first-order logic, we do not only re-place one variable by one term but one model variable by aset of program variables.Definition 2.5 (Substitution) A substitution σ is a functionwhich maps model variables to a set of program variables,i.e., σ : M 7→ 2V . The result of the application of the substi-tution σ on a dependence relation R is a dependence relationwhere all model variables x in R have been replaced by σ(x).

In order to compare a computed dependence set with thespecification we have to find a substitution that makes thecomputed dependence set equivalent to the specified one. Ifthere is no such substitution the sets are said to be inconsis-tent.

It remains to show that the approach is feasible in prac-tice. Hence, we have to show that (1) finding a substitutionis decidable, and (2) can be done efficiently. Since both theset of model variables and the set of program variables (for agiven program) are finite, checking all possible combinationsfor substitutions is possible.

The set of model variables is finite because every modelvariable corresponds to a program statement and the numberof statements of a program is finite. The set of program vari-ables is finite because no new variables can be generated atruntime. However, checking all possible combination is notfeasible. Hence, we have to search for a more efficient proce-dure.

For the purpose of finding an efficient algorithm for com-puting a substitution that makes a dependence set equivalentto its specification we first map the problem to an equivalentconstraint satisfaction problem (CSP). A CSP [Dechter, 1992;2003] comprises variables V ars, their domains Dom, and aset of constraints Cons that have to be fulfilled when assign-ing values to the variables. A value assignment that fulfillsall constraints is said to be a solution of the CSP. Every solu-tion to the corresponding CSP is a valid substitution. Hence,we can make use of standard CSP algorithms for computingsubstitutions.

Algorithm toCSP(R,S)Input: A dependence set R and a grounded dependence set S.Output: A corresponding CSP (V ar,Dom, Cons).


C

C

C

C

C

C

C

(x2,y1)

(x3,y1)

(x4,y2)

(x6,y2)

(x5,y2)

(x7,y3)

(x8,y3)(x1,y1)C

Figure 1: The associated hyper-graph’s structure.

1. Every model variable x of R has a corresponding con-straint variables νx in V ar.

2. The domain Dom is equivalent to the set of all programvariables V (for all constraint variables).

3. For all program variables x of V we compute a func-tion θ which maps x to a set of program variables. θ isdefined as follows: θ(x) = y|(x, y) ∈ S∧(x, y) /∈ R.

4. For all elements (x, y) of Πx|(x,y)∈SPEC(R)do: If y is a model variable, i.e., y ∈M , then add a new

constraint C(x,y) to Cons. The scope of C(x,y) isνy and the set of valid tuples comprises only oneelement θ(x).

Note that there is at least a single variable x for any modelvariable y.

With every CSP instance (V ar,Dom,Cons) we associatea hyper-graph (V, H) where V = V ar and H denotes the setof constraints. The structure of the associated hyper-graphis rather simple. It simplifies to a constraint graph compris-ing only unconnected clusters of constraints which belong tothe same model variable. Figure 1 exemplifies the associatedhyper-graph’s structure.

We can further improve the computation of solutions. Weonly have to check whether all constraints which correspondto a model variable have the same valid tuple. If this is thecase, then the tuple presents a substitution for the model vari-able. Otherwise, there is no substitution.

Algorithm findSubstitution(R,S)Input: A dependence set R and a grounded dependence set S.Output: A valid substitution that makes R and S equivalentor ⊥ if there is no such substitution.

1. Let (V ars, Dom, Cons) be toCSP(R,S).2. For every model variable ξ do:

(a) If all valid constraints C(x,ξ) ∈ Cons have theequal valid tuple M , then σ(ξ) = M .

(b) Otherwise return ⊥.3. Return σ.Finally, we are now able to define the equivalence of a de-

pendence set and its grounded specification.

Definition 2.6 (Equivalence) A dependence set R is equiv-alent to its grounded specification S iff there exists a σ =findSubstitution (R,S) 6= ⊥ and σ(R) = S.

The following example not only shows how depen-dences are computed but serves as one example for spuri-ous dependence relations. Consider the program fragmentx=y+r;x=x-r. In this program x only depends on vari-able y and not on r. However, the computation leads toD(x = y + r) = (x, y), (x, r) and D(x = x − r) =(x, x), (x, r) and finally we obtain D(x = y + r; x =x − r) = (x, y), (x, r) where the rightmost entry repre-sents a spurious dependence. Hence, it is not possible tocompute a minimal set of dependences. Only an approxi-mation is possible. To reflect this fact, we employ a weakercriterion than logical equivalence. The following definition isused for checking consistency between the computed depen-dences and the specification.Definition 2.7 (Contradiction, Fulfillment) Let R be thecomputed dependence relation for a program P , i.e., R =D(P ) under a given set of assumptions A, and S a specifica-tion, i.e., a grounded set of dependences. We say that R fulfillsS if there exists a substitution σ = findSubstitution(R,S) 6=⊥ and σ(R) ⊇ S. Otherwise, R contradicts S.

Hence, in the above example the computed dependenceset (when assuming line 1 and 2 to be correct) fulfills thespecification (x, y) although they are not equivalent. Find-ing a bug is now done by finding a set of assumptions thatfulfills the given specification. Model-based diagnosis al-gorithms, e.g., Reiter’s hitting set algorithm [Reiter, 1987;Greiner et al., 1989], can be used for this purpose.

Formally, it remains to introduce how to extract depen-dence information from the source code. Figure 2 showsthe appropriate rules. In the figure function D returns thedependences for a given statement and function M returnsthe variables employed within a given statement. Moreover,function var returns a given expression’s variables. For moredetails about how to extract dependences we refer the readerto [Jackson, 1995]. Note that the proposed model is differentfrom the model employed in [Jackson, 1995]. In contrast tothe model proposed there, we employ a different operator forstatement composition.

In dealing with programs relying on procedural abstrac-tions we have to (1) extend our model with rules for mappingformal parameters to actual ones, (2) clarify how to handlereturn values, and (3) incorporate recursive invocations.

Figure 2 outlines the rules for dealing with procedures. Thefirst part of rule 6 states that we first compute the dependencesof the procedure’s body and afterwards substitute the formalparameters by the actuals. After having obtained the proce-dure’s dependences (including actual parameters) we have toidentify those actuals influencing the variables appearing inthe procedure’s return statements. The second part states howto establish the relationship between the return variables andthe target variable of the calling context.

If we assume an invocation to be abnormal we introducea single variable for every occurrence of a certain procedure.For recursive invocations (in all cases where we obtain ancyclic call graph) we have to perform a fixpoint analysis. Inorder to guarantee that the computed dependences increasemonotonically w.r.t. the subset relation, we add the depen-dences for procedural invocation to those of the calling con-text (see rule 5 for procedure invocations). Thus, at the cost ofover-approximating dependences, we can safely assume thatthere is always a fixpoint.


1. Assignments:¬Ab(x = e) → D(x = e) = (x, v)|v ∈ vars(e) where vars is assumed to return all variables which are used in expression e.M(x = e) = xAb(x = e) → D(x = e) = (x, ξι)

2. Conditionals:¬Ab(if e then S1 else S2) → D(if e then S1 else S2) = (D(S1) ∪ (M(S1)× vars(e)), D(S)2)× ∪(M(S2)× vars(e)))

M(if e then S1 else S2) = M(S1) ∪M(S2)

Ab(if e then S1 else S2) → D(if e then S1 else S2) = (D(S1) ∪ (M(S1)× ι), D(S2))× ∪(M(S2)× ι))

3. Loops:Wi = if b then S; Wi−1D(W0) = D¬AB(Wi) = D(S; Wi−1) ∪ (M(S; Wi−1)× b) = D(S) •D¬AB(Wi−1) ∪ (M(S)× b)DAB(Wi) = D(S; Wi−1) ∪ (M(S; Wi−1)× b) = D(S) •DAB(Wi−1) ∪ (M(S)× b)¬AB(while b do S ) → D¬AB(while b doS ) =

Si D¬AB(Wi)

AB(while b do S ) → DAB(while b doS ) =S

i DAB(Wi)Note that we introduce a single variable ι for every unfolding Wi in terms of the abnormal behavior for conditionals. D¬AB(Wi) denotes dependences for correct behaviorof the conditional, and DAB(Wi) denotes the dependences for the abnormal behavior.

4. No-operation (NOP):D(nop) = M(nop) =

5. Sequence of statements:

D(S1; S2) =

8><>:

D(S1) •D(S2) otherwise

D(S1) • (D(Sthen), D(Selse)) S2 = if (e) then ST hen else Selse

D(S1) ∪D(S2) S2 = t → proc(a1, a2, .., an)

M(S1; S2) = M(S1) ∪M(S2), where R1 • (R2, R3) = R1 • R2 ∪ R1 • R3.

6. ProceduresD(proc(a1, ..., an)) = D(body(proc(f1, .., fn))) • (fi, ai)|i ∈ 1..n, where where D(body(proc(f1, .., fn))) denotes the dependences of theprocedure’s body including the formal parameters f1, .., fn

¬Ab(x = proc(a1, a2, ..., an)) → D(t = proc(a1, a2, ..., an)) =t × v|(x, v) ∈ D(proc(a1, ..., an)), x ∈ return(proc)Ab(t = proc(a1, a2, ..., an)) → D(t = proc(a1, a2, ..., an)) = (t, ξι), where t denotes the target variable and return(proc) is a function returning thereturn values of the procedure proc

Figure 2: The verification-based model

We illustrate the basic definitions and the algorithms usingour running example program where the area and the circum-ference of a circle is computed. The case where we assumethat all statements work correctly was captured previously.For example, we might assume that line 1 is faulty (AB(1)):

1. d = r * 2; r1 = (d, ξ12. a = d; r2 = (a, d)3. c = r * r * pi; r3 = (c, r), (c, pi)

The summarized dependence R1 (after projection onrelevant variables – that is, after applying Πc,a) is(c, r), (c, pi), (a, ξ1). In projecting R1 on the set A,A refers to target variables contained in the specifica-tion. We now compare R1 with the specification S =(c, r), (c, pi), (a, r), (a, pi) and see that they are equiv-alent when using the substitution σ(ξ1) = r, pi. Hence,line 1 is a possible fault location.

For computing diagnoses we solve the CSP given in Sec-tion 2. In practice, we solve this CSP for every statementassumed to be erroneous, thus for a specific CSP only a sin-gle model variable is present. However, as Figure 1 suggests,this procedure can easily be extended towards searching formultiple-fault diagnoses.

In a similar fashion, we obtain AB(2) as a possible can-didate. In contrast to this, AB(3) does, however, not yieldto a valid substitution and thus cannot be responsible for thedifferences between specified and computed dependences.All other assumptions are supersets of diagnoses already ob-

tained. Hence, we stop searching for bug locations and giveback two single-fault diagnoses, i.e., AB(1) and AB(2).

3 Comparing Fault Localization ModelsThe model comparison we present in the following relies ona couple of (reasonable) assumptions. First, for the FDMwe need to have a test case judging the correctness of spe-cific variables. In general, finding an appropriate test caserevealing misbehavior w.r.t. specific variables is a difficulttask, however, the presence of such a single test case is a re-quirement for applicability of the FDM. For the VDM, weassume an underlying assertion language, and a mechanismfor deducing dependence specifications from this language.Dependences are further oriented according to last-assignedvariables and specified in terms of inputs or input parametersrather than intermediate variables. For simplicity, we furtherassume that there are no disjunctive post conditions.

In the following we illustrate the introduced models’strength and weaknesses in terms of simple scenarios. In thefigures the left hand side is a summary of the FDM model in-cluding the observations obtained from running the test caseand the left hand side outlines the VBM. For both columns wesummarize the obtained diagnosis candidates in terms of theset DIAG. Note that we only focus on single-fault diagnosisthroughout the following discussion.

Figure 3 outlines a code snippet together with the asser-tion checking a certain property, the FDM, and the specified


1 proc (a ,b) ...2 x = a + b;3 y = a / b; // instead of y = a ∗ b4 assert (y == a ∗ b)5 ..

¬AB(2) ∧ ok(a) ∧ ok(b) → ok(x) SPEC(proc) = (y, a), (y, b)¬AB(3) ∧ ok(a) ∧ ok(b) → ok(y) dep(proc) = (y, a), (y, b)

dep(proc) ⊇ SPEC(proc)→ ok(a),→ ok(b),→ ¬ok(y) DIAG = DIAG = AB(3)

Figure 3: Code snippet, FD model, and specified and com-puted dependences.

1 proc (a ,b,x,y) ...2 x = a + b;3 x = x + 2; // instead of y = x + 24 assert (y == x + 2, x == a + b)5 ..

¬AB(2) ∧ ok(a) ∧ ok(b) → ok(x′) SPEC = (y, a), (y, b)(x, a)(x, b)¬AB(3) ∧ ok(x′) → ok(x′′) dep(proc) = (x, a), (x, b)

dep(proc) 6⊇ SPEC(proc)¬ok(x), ok(a), ok(b) σ(ξ2) = , σ(ξ3) = → ¬ok(x′′),→ ok(a),→ ok(b)DIAG = AB(2), AB(3) DIAG =

Figure 4: The misplaced left-hand side variable.

an computed dependences. Obviously, the VBM is unableto detect and thus localize this specific (functional) fault. Incontrast to this, the FDM is able to localize this specific fault.Due to the failed assertion we can conclude that there is some-thing wrong with variable y, thus ¬ok(y) holds. We also canassume that inputs a and b are correct, thus the assumptionsok(a) and ok(b) directly deliver line 3 (AB(3)) as the solesingle-fault diagnosis.

Moreover, as Figure 4 illustrates, although the VBM allowsfor detecting misplaced left-hand side variables, the VBMcannot localize these kind of faults. Assume that a = 1, b =1, x = 2 thus y = 4. Our assertion suggests to assume thedependences (y, a), (y, b), (x, a), (x, b). Obviously, bothmodels allow for detecting the fault. When employing theFDM, from the raised assertion we know that ¬ok(x) holds.In order to conclude that the outcome of statement 3 is cor-rect, we need to know that x is correct prior to this statement’sexecution. Thus, to obtain the contradiction we have to as-sume that both statements are correct.

By reverting the correctness assumption about statement 2we obviously can remove the contradiction. Moreover, re-verting the assumption about statement 3 also resolves thecontradiction. Thus, we obtain two single-fault diagnosisAB(2) and AB(3). In contrast to this, since y never appearsas target variable, we cannot obtain dependences for variabley and thus the VBM cannot localize these kind of (structural)faults.

The next example points out that the VBM fails in casethe fault introduces additional dependences. In Figure 5

1 proc (a ,b,c ,d) ...2 x = a + b;3 y = x + c + d; // instead of y = x + c4 assert (y == x + c)5 ..

¬AB(2) ∧ ok(a) ∧ ok(b) → ok(x)¬AB(3) ∧ ok(x) → ok(c) ∧ ok(d) → ok(y)→ ¬ok(y),→ ok(a),→ ok(b)DIAG = AB(2), AB(3)

SPEC(proc) = (y, a), (y, b)(y, c)dep(proc) = (y, a), (y, b)(y, c)(x, a)(x, b)dep(proc) ⊇ SPEC(proc)DIAG =

Figure 5: A typical (structural) fault inducing additional de-pendences.

we assign x + c + d instead of x + c to the variabley. Our assertion indicates that y depends upon x and c,thus SPEC(proc) = (y, a), (y, b), (y, c). Computing theprogram’s actual dependences dep(proc), however, yieldsto (y, a), (y, b), (y, c), (y, d) ⊇ (y, a), (y, b), (y, c) andthus VBM cannot detect this specific malfunctioning nor lo-cate the misbehavior’s cause. By employing the FDM underthe assumption ¬ok(y) we obtain two single-fault diagnosisAB(2) and AB(3).

Stumptner [Stumptner, 2001] shows that localizing struc-tural faults requires exploiting design information like asser-tions, and pre- and post conditions. Again, we outline this interms of a few small examples. Although the previous exam-ples show that the VBM cannot detect neither locate certaintypes of faults, it may provide reasonable results in capturingstructural faults.

Figure 6 illustrates an example where the fault manifests it-self in inducing less dependences than specified. Our specifi-cation is SPEC(proc) = (y, a), (y, b), (y, c). Obviously,the computed dependences (y, a), (y, b) 6⊇ SPEC(proc).As the figure outlines, we obtain two single-fault diagnosiscandidates, AB(2) and AB(3). In this case, the FDM is alsocapable of delivering the misbehavior’s real cause, it returnstwo single-fault diagnosis candidates: AB(2) and AB(3).

1 proc (a ,b,c) ...2 x = a + b;3 y = x; // instead of y = x + c4 assert (y == a + b + c)5 ..

¬AB(2) ∧ ok(a) ∧ ok(b) → ok(x) SPEC(proc) = (y, a), (y, b), (y, c)¬AB(3) ∧ ok(x) → ok(y) dep(proc) = (y, a), (y, b)→ ¬ok(y),→ ok(a),→ ok(b) dep(proc) 6⊇ SPEC(proc)DIAG = AB(2), AB(3) σ(ξ2) = a, b, c),σ(ξ3) = a, b, c

DIAG = AB(2), AB(3)

Figure 6: A typical (structural) fault inducing fewer depen-dences than specified.


Our final example in Figure 8 illustrates that both ap-proaches might deliver reasonable but different results. Weassume a = 1, b = 1, e = 0, thus we expect z = 2 and d = 0.However, due to the introduced fault, we obtain z = 1 andd = 0. Since the value of z is incorrect, but d = 0, we con-clude that ¬ok(z) and ok(d) holds. Thus, we obtain AB(2)and AB(4) as diagnosis candidates. Note that this result pri-marily roots in the coincidental correctness of variable d.

Given the assertion in Figure 8 we are aware of the de-pendences (d, a), (d, b), (d, e), (z, a), (z, b), (z, e). As thefigure outlines, we obtain two single-fault diagnosis AB(2)and AB(3). As is also indicated in the figure, when solelyemploying a single assertion requiring z == c+d, we obtainSPEC ′(proc) = (z, a), (z, b), (z, e) and dep′(proc) 6⊇SPEC ′(proc). Consequently, we obtain 3 diagnoses in thiscase. However, even when employing the FDM we cannotexclude a single statement, thus, in this specific case, bothmodels deliver the same accuracy.

The examples outlined above should have made clear thata comparison of both models in terms of their diagnostic ca-pabilities inherently depends on how we deduce observationsfrom violated properties. Note that the FDM itself cannot de-tect any faults, rather faults are detected by evaluation of theassertions on the values obtained from a concrete test run.

The VBM can reliably detect and localize faults that man-ifest in missing dependences on the right-hand side of an as-signment statement. Due to the over-approximation of de-pendences and the definition of the fulfillment criterion (seeDefinition 2.7) we cannot locate faults manifesting in addi-tional dependences as it is impossible to distinguish if (1) thespecification is incomplete, (2) the model computes spuriousdependences, or (3) an unwanted dependence is present dueto a fault.

Table 1 summarizes the illustrated examples by listing theindividual models’ fault detection and localization capabili-ties. For those examples, where both models deliver diagno-sis candidates, we checked whether the diagnoses providedby the VBM are a subset of those provided by the FDM .

example FDM VBM diags(FDM) ⊇det. loc. det. loc diags(V BM)

Fig. 3√ √ × × -

Fig. 4√ × √ × -

Fig. 5 × × √ √-

Fig. 6√ √ √ √ √

Fig. 8√ √ √ √ ×

Table 1: Summary on the outlined scenarios.

In order to compare different models of programs for faultdetection and localization, we first introduce the debuggingproblem formally. Similar to Reiter’s definition of a diagno-sis problem [Reiter, 1987] a debugging problem is charac-terized by the given program and its expected behavior. Incontrast to Reiter we assume the existence of a specificationthat captures the whole expected behavior and not only be-havioral instances like given by the set of observations OBSin Reiter’s original definition.

Definition 3.1 (Debugging problem) A debugging problemis characterized by a tuple (Π, SPEC) where Π is a pro-gram written in a certain programming language and SPEC

DIAG = VBM(p) DIAG = FDM(p)

DIAG=

DIAG=x|x is in stmnt(p)

Figure 7: The (open) relationship between VBM and FDM

is a (formal) specification of the program’s intended behav-ior. The debugging problem now can be separated into threeparts:

1. Fault detection: Answer the question: Does Π fulfillSPEC ?. In case a program fulfills (does not fulfill) itsspecifications we write Π∪SPEC 6|= ⊥ (Π∪SPEC |=⊥ respectively).

2. Fault localization: Find the root cause in Π which ex-plains a behavior not given in SPEC.

3. Fault correction: Change the program such that Π ful-fills SPEC.

Note that SPEC is not required to be a formal specifica-tion. It might represent an oracle, i.e., a human, which is ableto give an answer to all questions regarding program Π. Inthis paper we focus on the first two tasks of the debuggingproblem. Because fault localization and correction can onlybe performed when identifying a faulty behavior, from hereon we assume only situations where (Π, SPEC) |= ⊥. Thequestion now is how such situations can be detected in prac-tice.

The availability of a specification that is able to answer allquestions is an assumption which is hardly (and not to say im-possible) to fulfill. What we have in practice is a partial spec-ification. Therefore, we are only able to detect a faulty behav-ior and not to prove correctness. Obviously different kind ofspecifications may lead to different results to the first task ofthe debugging problem, i.e., identifying a faulty behavior. Inthe context of this article the question about the satisfiabilityof Π ∪ SPEC |= ⊥ is reduced to checking the satisfiabil-ity of two sentences, i.e., FDM(Π) ∪ SPECFDM |= ⊥and V BM(Π) ∪ SPECV BM |= ⊥ where SPECV BM andSPECFDM are the partial specification which belong to theFDM and VBM respectively.

In comparing both models, we start by contrasting the well-known artifacts in the area of MBSD. Table 2 summarizes themost notable differences in employing the VBM and FDMfor fault localization. In both models we employ a partialspecification (e.g. test case, assertion, invariant) for deducinga number of observations. Whereas the VBM encodes ob-servations in terms of dependence relations, the FDM relieson a program’s execution and subsequent classification of theobserved variables. Variables are merely classified as beingcorrect or incorrect with respect to a given (partial) specifica-tion.


artifact VBM FDMobservations dependence rela-

tionsok,¬(ok)

system descr. functions overdependence rela-tions V BM(Π)

Horn clausesFDM(Π)

fault detect. V BM(Π) 6⊇SPEC

FDM(Π) ∪SPEC =⊥

fault localiz. V BM(Π) ⊇SPEC

FDM(Π) ∪SPEC 6=⊥

assumptions variable substitu-tion ξ = ...

¬AB

theorem prover CSP solver Horn clause theo-rem prover

structural faults detect., localiz. detect, localiz.functional faults no detect., no lo-

caliz.detect., localiz.

Table 2: Comparing th most common artifacts.

Furthermore, the VBM models the program in terms offunctions over dependence relations, the FDM captures theprograms behavior by a number of logical sentences, in par-ticular we employ a Horn clause theory. The VBM detects afault by checking whether the system description fulfills thegiven specification according to the criterion given in Defini-tion 2.7. In case this relationship does not hold, a fault hasbeen detected. In contrast, we detect a fault with the FDM ifthe system description together with the specification yieldsto logical contradiction.

The VBM locates possible causes for detected misbehaviorby assuming that specific statements depend on model vari-ables, and checking whether there is a valid substitution forfulfillment (see Definition 2.7). As outlined in Section 2, thisprocess is efficiently done by solving a CSP. Instead, the FDMemploys a Horn clause theorem prover under the assumptionof statement abnormality in computing diagnosis candidates.Note, that whereas the FDM does not assume any faulty be-havior for specific statements, the VBM assumes specific de-pendences guided by the specification.

As indicated by the example above, the VBM is tailored to-wards detection and localization of structural faults, whereasthe FDM may capture structural but particularly functionalfaults. Similar to static slicing capturing control as well asdata flow dependences, the FDM must comprise all state-ments responsible for the computation of an erroneous vari-able. Thus, the FDM always provides diagnosis candidatesunder presence of an erroneous variable. The author of[Wotawa, 2002] points out that the FDM delivers at least thesame results as static slicing. Moreover, we know that themisbehavior’s real cause is always among the delivered diag-nosis candidates when employing the FDM. This perspectiveis supported by theoretical foundation [Friedrich et al., 1999]as well as practical evidence in numerous case studies.

Particularly, a comparison w.r.t. the accuracy and com-pleteness of the obtained diagnosis is of interest. Figure 7summarizes the relationship of the FDM and the VBM re-garding their abilities of checking satisfiability. The lines be-tween the nodes building up the lattice denote a subset rela-tionship. As illustrated by the examples, there are debugging

1 proc (a ,b,e) ...2 c = a ; // should be c = a + b3 d = c∗e;4 z = c + d5 assert (z == c + d, [d == c ∗ e] )6 ..

¬AB(2) ∧ ok(a) → ok(c)¬AB(3) ∧ ok(c) ∧ ok(e) → ok(d)¬AB(4) ∧ ok(c) → ok(d) → ok(z)[→ ok(d)],→ ¬ok(z)DIAG = AB(2), AB(4)DIAG′ = AB(2), AB(3), AB(4)

SPEC(proc) = (z, a), (z, b), (z, e), (d, a), (d, b), (d, e)dep(proc) = (z, a), (z, e), (d, a), (d, e)dep(proc) 6⊇ SPEC(proc)σ(ξ2) = a, b,σ(ξ3) = a, b, e,σ(ξ4) = DIAG = AB(2), AB(3)SPEC′(proc) = (z, a), (z, b), (z, e)dep′(proc) = (z, a), (z, e)dep′(proc) 6⊇ SPEC′(proc)σ(ξ′2) = a, b,σ(ξ′3) = a, b, e,σ(ξ′4) = a, b, eDIAG′ = AB(2), AB(3), AB(4)

Figure 8: A degenerated example (error masking),diags(FDM) 6⊇ diags(V BM).

problems where the VBM allows for finding a discrepancybut the FDM does not and vice versa.

4 Case StudiesIn [Peischl et al., 2006] we present first experimental re-sults indicating our approaches’ applicability. The resultspresented there solely stem from programs without proce-dures. In the following we extend these results with resultsobtained from programs comprising procedures. In evaluat-ing the model’s fault localization capabilities under presenceof procedural abstraction, we decompose a program into sev-eral procedures in a step by step fashion. This procedure al-lows for a first evaluation of both, the model for (1) parameterpassing and (2) handling of return values.

Table 3 summarizes our most recent results. Specifically,the program eval evaluates the arithmetic expression z ←(r×h)+(c/d)−(d+h)×(e+f). The specification says thatthe left-hand side z depends on the variables r, h, c, d, e, andf . We introduced a single structural fault and decomposedthis program by adding procedures computing specific subex-pressions in a step by step fashion. A specific subexpressionis thus evaluated by a single procedure and replaced by thevariable capturing this procedure’s evaluation. We refer tothe decomposed programs comprising i methods by eval(i).In the remaining programs, which perform simple computa-tions like taxes or evaluate simple arithmetic expressions, wealso introduced a single structural fault.

Removing certain dependences from the specification al-lows for evaluating our model’s capabilities in localizingstructural faults under presence of partial knowledge of thedependences of the output variables. Thus, we observed asubset of the output dependences involving up to 5 variablesand recorded the minimum and maximum number of diagno-sis candidates.


total min, max no. diagnosis candidatesmethod no. LOC dep. no.

5 4 3 2 1eval(1) 10 9 - - - - 4eval(2) 14 10 - - - 4 4-11eval(3) 18 11 - - 4 4-13 4-18eval(4) 22 12 - 4 4-15 4-22 4-22eval(5) 26 13 4 4-17 4-26 4-26 4-26

sum 22 11 - - 4 4-13 4-18artihmetics 26 12 - 4 4-15 4-15 4-22tax comp. 30 13 4 4-17 4-26 4-26 4-26calculator 40 12 1-31 1-31 1-33 1-34 1-34

Table 3: Number of single-fault diagnosis candidates withdecreasing number of specified output variables.

For example, regarding the program eval(3) we obtained 4diagnosis candidates when observing all outputs. Afterwardswe selected 2 output variables out of the 3 output variables,and for all possible combinations of selecting 2 out of 3 out-puts, we recorded the number of diagnoses. The table speci-fies the minimal and maximal number of diagnosis candidatesobtained in this way (in this specific case of considering 2output variables we obtain at least 4 and at most 13 diagno-sis candidates). We checked whether or not the introducedfaults appear among the delivered diagnosis candidates. Re-garding all our experiments, we have been able to locate themisbehavior’s real cause.

Furthermore, the table lists the number of total depen-dences (column 3) and the program’s size in terms of thelines of code (column 2). Our experiments indicate an in-crease in the number of candidates with a decreasing numberof outputs being considered. In the table, we did not take intoaccount cases where the reduced output dependences are notcapable of detecting the fault. In this case our approach ob-viously returns . In summary, the obtained results, confirmthe findings in [Hamscher and Davis, 1984]: As our problembecomes under-constrained by removing certain output de-pendences, the number of diagnosis candidates may increasedrastically. As our experiments indicate, this also appears tohold for the novel model introduced herein.

5 Conclusion and Future ResearchIn this article we extended and formalized the so calledverification-based model [Peischl et al., 2006] specificallytailored towards detecting and localizing structural faults. Wediscussed the relationship between this model and the well-known functional dependence model [Friedrich et al., 1999]by exemplifying the weaknesses and strengths of both mod-els.

Our examples show, that there are debugging problemswhere the verification-based model delivers different diag-noses than the functional-dependence model and vice versa.Furthermore, we present case studies we conducted recently.Notably, whenever our novel model detects a structural fault,it also appears to be capable of localizing the misbehavior’sreal cause.

A future research challenge is the empirical evaluation ofthe modeling approaches discussed herein. Most notably, thisaddresses issues such as the evaluation of the proposed opera-tor for the compound statement as well as the criteria for relat-ing the (conservatively approximated) program dependencesto the specified ones.

References[Dechter, 1992] Rina Dechter. Encyclopedia of Artificial In-

telligence, chapter Constraint Networks, pages 276–285.John Wiley & Sons, 2nd edition edition, 1992.

[Dechter, 2003] Rina Dechter. Constraint Processing. Mor-gan Kaufmann, 2003.

[Friedrich et al., 1999] Gerhard Friedrich, Markus Stumpt-ner, and Franz Wotawa. Model-based diagnosis of hard-ware designs. Artificial Intelligence, 111(2):3–39, July1999.

[Greiner et al., 1989] Russell Greiner, Barbara A. Smith, andRalph W. Wilkerson. A correction to the algorithm in Re-iter’s theory of diagnosis. Artificial Intelligence, 41(1):79–88, 1989.

[Hamscher and Davis, 1984] Walter C. Hamscher and Ran-dall Davis. Diagnosing circuits with state - an inherentlyunderconstrained problem. In Proceedings of the NationalConference on Artificial Intelligence (AAAI), pages 276–282. Morgan Kaufmann, 1984.

[Jackson, 1995] Daniel Jackson. Aspect: Detecting Bugswith Abstract Dependences. ACM Transactions on Soft-ware Engineering and Methodology, 4(2):109–145, April1995.

[Peischl et al., 2006] Bernhard Peischl, Safeeullah Soomro,and Franz Wotawa. Towards lightweight fault localizationin procedural programs. In To appear in Proceedings ofthe 19th Conference on International Conference on In-dustrial, Engineering & Other Applications of Applied In-telligent Systems (IEA/AIE 2006), Lecture Notes in Artifi-cial Intelligence (LNAI). Springer Verlag, 2006.

[Reiter, 1987] Raymond Reiter. A theory of diagnosis fromfirst principles. Artificial Intelligence, 32(1):57–95, 1987.

[Stumptner, 2001] Markus Stumptner. Using design infor-mation to identify structural software faults. In AI ’01:Proceedings of the 14th Australian Joint Conference onArtificial Intelligence, pages 473–486, London, UK, 2001.Springer-Verlag.

[Weiser, 1984] Mark Weiser. Program slicing. IEEE Trans-actions on Software Engineering, 10(4):352–357, July1984.

[Wotawa, 2002] Franz Wotawa. On the Relationship be-tween Model-Based Debugging and Program Slicing. Ar-tificial Intelligence, 135(1–2):124–143, 2002.


A Bayesian Approach to Fault Isolation with Application to Diesel EngineDiagnosis

Anna Pernestål and Mattias NybergScania CV AB

Swedenanna.pernestal, [email protected]

Bo WahlbergKTH Signals, Sensors and Systems

[email protected]

Abstract

This paper considers a Bayesian approach to faultisolation. Given a set of measurements from thesystem, and a set of possible faults, the task is tocalculate the probability that the faults are present.This probability can then be used to rank the faults,or for decisions on fault sccomodation. The methodrequires the conditional probability distribution de-scribing how the measurements react to the faults.In particular, the structure of dependencies betweenthe tests is important. Knowing the structure facil-itates efficient computation methods and makes itpossible to reduce the memory capacity needed. Inthis paper, the structure is estimated from trainingdata using Bayesian methods. The method is ap-plied to diagnosis of the gas flow in a diesel engine.

1 IntroductionFault isolation concerns the problem of localizing faults intechnical processes. This is a most important problem in allfield of industrial systems. Our motivating application is on-board fault isolation for diesel engines, where maintenanceand repair procedures together with new emission regulationsput challenging demands on the corresponding diagnosis sys-tems. Other challenges are noise and model errors, which in-troduces uncertainty to the diagnosis process, and the limitedstorage capacity in the on-board control unit where the isola-tion system should be implemented. Further, industrial sys-tem are often large and complex, and it is impossible to builda complete model that is executable in the on-board controlunit for the whole system.

The diagnosis system is structured as in Figure 1. Thisarchitecture is commonly used for diagnosis in the FDI com-munity, for example when utilizing structured residuals, see[Gertler, 1998]. Further, it is one of the architectures used inindustrial applications, and with this motivation we will useit in the present work.

The process to be diagnosed is assumed to consist of a setof components, which can be faulty or non-faulty. The com-ponents are monitored by precompiled diagnostic tests. Anexample of a diagnostic test is a thresholded residual. In theisolation system, the outputs from the tests are used to makeinference about possible present faults.

Test 1 c 1

c 2

c n

Isolation system

Diagnoses Test 2

Test m

Process

Figure 1: An example of relations between componentsci,tests and the isolation system.

In this work, the diagnostic tests are assumed to be given,and we will focus on the isolation system in Figure 1. Theisolation system computes diagnoses, i.e. the combinationsof faults that can explain the outputs from the tests. Withthe test results as our observations, this is the same definitionof diagnoses as in[de Kleeret al., 1992]. One problem isthat already for small sized processes there can be many di-agnoses, and hence a main requirement on the isolation sys-tem is that the diagnoses should be ranked after how probablethey are. A second requirement is set by the limited processorand memory storage capacity in the on-board control unit.

In this work, a Bayesian approach is used for fault isola-tion. Given a set of test results, and a set of possible faults, theprobabilities that different faults are present are computed.These probabilities are called posterior probabilities and canbe used to rank the faults, or for decision making about faultaccommodation. In order to compute the posterior, the condi-tional probability describing how the tests react on the faultsis needed. In particular, the parameters and the structure ofdependencies in the conditional distribution are needed. Acomplex structure, allowing a lot of dependencies, will in-crease the storage capacity needed. On the other hand, a toosimple structure will affect the performance of the isolation.

There are two key contributions in this work. The first isthat the structure of conditional probability is used as a de-sign variable in the construction of the isolation system. Thesecond is that Bayesian methods are used for estimation ofthe structure and the parameters of the conditional probabil-ity distribution from training data. Here, training data istheoutputs from the diagnostic tests under different working con-ditions.


The main advantage of estimating the structure from train-ing data is that no explicit knowledge about the process isneeded. This is an advantage since in many industrial ap-plications the system to be diagnosed is large and complex,and it is impossible to build a complete model of the wholesystem.

When the structure of dependencies and the parameters ofthe conditional probability is known, a Bayesian network canbe set up and computationally efficient methods for proba-bilistic inference can be used, see for example[Lerner, 2002],[Lu and Przytula, 2005] or [Jensen, 2001].

2 Related workWhen diagnosing complex systems, model errors, noise, anddisturbances introduce uncertainties in the diagnosis compu-tation. Several methods that handle the uncertainty have beenproposed in the literature. In[Colin N. Jones and Lawrence,2002] the PGDE (Probabilistic General Diagnostic Engine)algorithm is presented. In the PGDE the logic reasoning usedin [Reiter, 1992] is combined with a measure of the belief indifferent diagnosis. In[Touaf and Ploix, 2004] the problemof uncertainty is solved using fuzzy logic methods. In[Pulidoet al., 2005] several isolation algorithms are combined, andthe resulting algorithm is applied to uncertain models. In thepresent work, probabilistic reasoning is used, to compute theprobability for different diagnoses.

Other probabilistic methods can be found in the literature.The Sherlock algorithm[de Kleer and Williams, 1992], aswell as its precursor the GDE (General Diagnostic Engine)[de Kleer and Williams, 1987], contains a part, that corre-sponds to our isolation system and where probabilities for thediagnoses are computed. Those algorithms are designed forsystems without noise, and the conditional probability dis-tributions are assigned constant values, depending only onwhether the measurement from the system is consistent witha fault or not. In the present work, training data is used toestimate the conditional probability distributions. If there isno training data available for the estimation of the underlyingprobability distributions, our algorithm is basically thesameas in Sherlock and GDE.

In [Lerneret al., 2000], [Schwall and Gerdes, 2002] and[Lu and Przytula, 2005] probabilistic reasoning for isolationis successfully used on noisy systems, utilizing Bayesian net-works. In these three works knowledge about the processto be diagnosed is required to set up the structure for theprobabilistic reasoning. In[Lerneret al., 2000] the modelof the system is translated into a Temporal Casual Graphand the structure of a Bayesian network is learned from it.In [Schwall and Gerdes, 2002] the structure of the model isgiven as input to the design of the isolation. In[Lu and Przy-tula, 2005] a known structure is used, and focus is on effectivemethods for solving the inference.

3 Problem FormulationThis work consists of two separate problems. The first is howto estimate the structure of dependencies and relations be-tween the tests and the components, given a set of trainingdata. This part is referred to as thestructure problem. The

second problem is to utilize the structure to compute the di-agnoses when test results arrive to the isolation system, and iscalled theisolation problem. The structure problem requirestedious computations, but can be performed once and off-line.The isolation problem is performed on-line, where the com-putational and storage capacity is limited.

The process to be diagnosed consists of a set ofNC com-ponents, which can be faulty or not faulty. Enumerate thecomponents, and letci be a variable with domainNF,F,whereF means that thei:th component is faulty andNF thatit is not faulty. The variableci is called thebehavioral modeof thei:th component,[de Kleeret al., 1992].

Assume that there existND diagnostic tests. The diagnostictests can be discrete or continuous. Here we will assume thatthe tests are discrete, because this is actually the case in ourapplication, and because it simplifies the presentation. Notehowever that this is not necessary for the methods in general.

The diagnostic tests are assumed to be given, but we re-quire no knowledge about their explicit construction. Theonly information needed is which faults that can possibly af-fect each. If a test is affected by a certain fault, it is said tobe able to detect the fault, but due to model errors and noiseit does not necessarily detect it. Enumerate the tests and letdi denote the test result from testi. For example for a binarytest, the test result can be either 1, indicating that a faultisdetected, or 0 if no fault is detected.

The prior knowledge about the relations between tests andcomponents can be presented as an isolation structure wherean X at position(i, j) means that testi can react to a faultin the j:th component, but it does not necessary react everytime the fault is present. A 0 at position(i, j) means that testi and componentj are not related. For example, with threecomponents and three tests, the isolation structure can looklike

c1 c2 c3d1 X 0 Xd2 X X 0d3 0 X X

(1)

LetC = [c1, . . . ,cNC ] be an assignment of behavioral modesto all components in the system, and letDt = [d1, . . . ,dND ]be the test results at timet. We callC the system behavioralmode. The isolation problem is to compute the probability fordifferent assignments of behavioral modes to all componentsin the systems, given the test results a certain timet,

P(C|Dt), (2)

also referred to as theposterior probability. The probability(2), as well as all other probabilities in the following, shouldalso be conditioned on the prior knowledge about the process,but for notational convenience we leave this out unless it isespecially important. In the following we consider only thetest results from a certain time, and we will suppress the indext to simplify notations.

To compute (2), use Bayes’ rule,

P(C|D) =P(D|C)P(C)

P(D), (3)

whereP(C) is the prior probability for the system behavioralmode. These priors are assumed to be known. In real systems


this represents the knowledge of the quality of the compo-nents. In (3) the denominatorP(D) is a normalization factor,which can be computed using marginalization over all possi-ble system behavioral modes,

P(D) = ∑C

P(D|C)P(C). (4)

The probability distributionP(D|C) is called thelikelihoodfor C, and it will now be shown how it can be estimated fromdata.

For the on-line isolation, the likelihood is stored as a ta-ble, and when test results arrive to the isolation system, theprobabilities for different values ofC given the dataD arecomputed using (3). The table can be very large, for exampleassuming binary tests the number of elements needed for stor-age is 2NC+ND . Even for a small process containing only tentests and ten components this table has more than a millionelements, and the storage of the table is infeasible. To reducethe storage capacity needed for the likelihoodP(D|C), it canbe factorized into mutually independent factors.

One naive approach is to assume that all tests are indepen-dent given the behavioral modes. This assumption is gener-ally not true. Examples on situations where tests are depen-dent are when several faults have the same root cause, whenone test can cause another test to react, when there are errorsin the underlying models for the diagnostic tests, and whenthe probability that a test reacts is dependent on the workingpoint or the environment. Measurements of outputs from thetests in engine diagnosis have shown that some tests certainlyare dependent, while others are independent. Thus, assumingthat all tests are independent, the posterior probabilities forthe system behavioral mode will be incorrect.

Instead, partition the tests intoM subsets, such that the testsin different subsets are mutually independent, or can be as-sumed to be mutually independent, but the tests in the samesubset can be dependent. Let the maximum numbers of testsin a subset beL. Let Ii be an index vector, containing the in-dices for the tests that is in subseti, andD[Ii] be the tests withindices inIi. Then the partition of the testsD into M subsetsgives the factorization

P(D|C) = P(D[I1]|C)P(D[I2]|C) . . .P(D[IM]|C) (5)

of the likelihood. Each of the distributionsP(D[Ii]|C) can berepresented by a table, which maximum size is determined byL. To avoid too large tables, and decrease the storage capacityneeded, limits onL is used. For the case where the tests arebinary, the maximum number of elements in each subset is2NC+L. Although one table for each factor is needed, the totalstorage capacity required is reduced.

In (5) the subsetsD[Ii] are of different and unknown size.Also, the number of factors,M, in the factorization is un-known. Besides the factorization of the likelihood, the pa-rameters of the distributionsP(D[Ii]|C]) must be estimated.For the estimation, assume that we have a set of training dataD = [DC DD], whereDC are the behavioral modesDD thecorresponding test results. The structure problem can now bestated as estimating the factorization (5) and the parametersof the underlying structure, given the training dataD . Thestructure problem can also be thought of in terms of a model

selection problem. Here, we assign a model of the class oftwo-layer Bayesian networks, and use training data to esti-mate the best structure.

4 The Structure ProblemThe structure problem can be visualized graphically as goingfrom the left graph to the right graph in Figure 2. The leftgraph represents our prior knowledge about the relations be-tween tests and components (solid lines), and the unknownrelations between tests (dashed lines). The right graph repre-sents the estimated structure, where tests that are dependentare grouped into the same node. In Figure 2 the left graphrepresents the structure given by (1), where anX at position(i, j) in (1) gives a solid line between testi componentj. Theright graph is an example where the tests one and three aregrouped.

c 1 c 2 c 3

d 1 d 3 d 2

c 1 c 2 c 3

d 1 ? ?

d 3 d 2

?

Figure 2: The structure problem can be represented as goingfrom the left graph to the right.

To estimate the structure from data, a measure of how wella structure fits the training data is needed. In[Wolf, 1995]the χ2-test is compared with a Bayesian approach. The dis-advantage with theχ2-test is that it is accurate only for largedata sets. With the Bayesian approach, the probability thata certain structure is the underlying structure, given the train-ing data, is computed. This is valid also for small sets of data,see[Jaynes, 2001] and[Wolf, 1995]. The Bayesian approachsuits this problem, since the training data consists of few ex-amples for system behavioral modes which are unlikely toooccur. In the Bayesian approach prior probabilities for thedifferent structures must be given. In this work an uninfor-mative prior, ranking all structures as equally likely, will beused[Wolf, 1995].

To keep notations simple, the method will be illustratedwith a simple example, but it is straight forward to generalizeall reasoning to larger problems. In the example, there arethree components, represented by the variablesc1,c2 andc3,three tests, and the maximum factor sizeL is set to 2. The re-lations between tests and components is given by the isolationstructure (1). First, the structure is estimated, and then,giventhe structure, the parameters in the distribution is estimated.

4.1 Structure estimationWe search a factorization (5), or in other words the index setsIi, i = 1, . . . ,M, that suits all different assignments of systembehavioral modesC. To achieve this, we assume that the in-dex sets in (5) are the same as in

P(D) = P(D[I1])P(D[I2]) . . .P(D[IM]). (6)


Note that it is only the structure, i.e. the index setsIi, i =1. . .M and the number of elementsM that are assumed to bethe same in (6) and (5), and not the probabilities themselves.

This assumption is reasonable, since if two subsetsD[I j]andD[Ik] are independent, the knowledge ofC will not makethem dependent. On the other hand, ifD[I j] andD[Ik] are de-pendent, the knowledge ofC can make them independent. Aswill be shown in Section 7.1, this will not affect the isolationperformance, but only increase the storage capacity neededtoperform the on-board isolation.

To bias the factorization such that it is better suited formore important faults, more training data from those behav-ioral modes can be used. Here, all faults are assumed to beequally important and equal amount of training data is usedfrom each system behavioral mode.

Now, we introduce some notations. The distributions canbe represented by multidimensional arrays, with one dimen-sion for each test. Letp = P(D), and for the marginal dis-tributionsp12 = P(d1,d2) = ∑d3

P(d1,d2,d3), where we sumover all possible values ofd3 etc. Letlr be the number of ele-ments inpr, r = 1. . .3. For the elements in the distributions,let pi jk = P(d1 = i,d2 = j,d3 = k), p12

i j = P(d1 = i,d2 = j) andso on. Correspondingly for the data, letni jk be the number of

observations withd1 = i,d2 = j,d3 = k, andn12i j = ∑l3

k=1 ni jk

etc. The total amount of data isN = ∑l1,l2,l3i, j,k=1 ni jk.

In our examplep can be factorized in four different ways:such that all tests are independent, or such that two tests aredependent while the third is independent of the other two. Inthe example, useH0 to denote the hypothesis "all three vari-ables are independent" andHq, q = 1. . .3 to denote "variableq is independent and the other two are dependent". Givena hypothesisHq, q = 0. . .3, p can be factorized in a cer-tain way. For exampleH1 means thatp = p1p23. We searchthe probabilities for the different factorizations (6) given thetraining dataD . Since we only want one structure, we use theMaximum a posteriori (MAP) estimateH∗ = maxqP(Hq|D).Bayes’ rule gives

P(Hq|D) =P(D |Hq)P(Hq)

P(D), (7)

whereP(D) is a normalization factor, which can be computedusing marginalization,

P(D) =3

∑q=0

P(D |Hq)P(Hq). (8)

HereP(Hq) is the prior probability for the different factoriza-tions. We apply a prior that is zero for all partitions containingsubsets with more thanl elements, and constant for all other.The distributionP(D |Hq) can be computed using marginal-ization over all possible distributions,

P(D |Hq) =

∫P(D |p,Hq) f (p|Hq)d p, (9)

where f (p|Hq) is the continuous distribution forp and

P(D |p,Hq) is a multinomial distribution given by

P(D |p,H0) =N!

∏Ni, j,k=1 ni jk!

N

∏i, j,k=1

(p1i p2

j p3k)

ni jk (10a)

P(D |p,H1) =N!

∏Ni, j,k=1 ni jk!

N

∏i, j,k=1

(p1i p23

jk )ni jk , (10b)

and similarly forq = 2,3. The elements in distributions mustbe between 0 and 1, and for each distribution they must sumto one. The first criteria is regulated by the integration limitsin (9). The latter criteria means thatf is proportional to deltafunctions as

f (p|H0) = f (p1p2p3|H0) (11a)

∝ δ (l1

∑i=1

p1i −1)δ (

l2

∑j=1

p2j −1)δ (

l3

∑k=0

p3j −1)

f (p|H1) = f (p1p23|H0) = (11b)

∝ δ (l1

∑i=1

p1i −1)δ (

l2,l3

∑j,k=1

p23jk −1),

and similar forH2 andH3.The integral (9) can now be solved using convolution and

Laplace transform techniques[Wolf, 1995]. The result is∫

P(D |p,H0) f (p|H0)d p =N!

∏ni jk!Γ(l1)Γ(l2)Γ(l3)F0

(12a)∫

P(D |p,H1) f (p|H1)d p =N!

∏ni jk!Γ(l1)Γ(l2l3)F1 (12b)

whereΓ(·) is the gamma function and

F0 =∏3

q=1 ∏lqi=1 Γ(nq

i +1)

Γ(N + l1)Γ(N + l2)Γ(N + l3)(13a)

F1 =∏l1

i=1Γ(n1i +1)∏l2,l3

j,k=1 Γ(n23jk +1)

Γ(N + l1)Γ(N + l2l3). (13b)

The expression (12) and (13) are similar forH2 andH3.Now, we can computeP(Hq|D) for all q. With the MAP

estimateH∗ = maxq P(Hq|D) for the structure, i.e. the indexsets in (6) and hence also in (5), we can estimate the parame-ters in the factors in (5).

4.2 Parameter estimationFor the parameter estimation we use the same notation as inSection 4.1, but with subindexC to denote that we condi-tion on the system behavioral mode, i.e.pC = P(D|C). Wewill use the MAP estimatep∗C, that maximizesf (pC|D ,H∗).Again, apply Bayes’ rule,

f (pC|D ,H∗) =P(D |pC,H∗)P(pC|H∗)

f (D |H∗), (14)

whereP(pC|H∗) is the prior for pC given the partitionH∗.In this work a prior that is uniform for allpC which suitsthe partitionH∗ and the structure (1), and is zero for all


other pC’s is applied. The denominator in (14) is a normal-ization factor, independent ofpC, and henceP(pC|D ,H∗) ∝P(D |pC,H∗) for all pC suitable toH∗. The MAP estimateis p∗C = maxpC P(pC|D ,H∗). From (5) and givenH∗ weknow that we can factorizepC = ∏M

m=1 pmC . The distrib-

ution P(D |pC,H∗) is minimized under the constraint thatall elements in each factor should sum to one,∑i pm

C,i = 1,m = 1. . .M, using Lagrange multipliers. The result is

pm∗C,x =

nmx

N, (15)

for x = 1. . . lm. With pm∗C = [pm∗

C,1 . . . pm∗C,lm

] and p∗C =

∏Mm=1 pm∗

C we know, together withH∗, both the structure andthe parameters in (5), and the isolation problem can be solvedusing probabilistic inference.

No training DataIf there is no training data available, other ways of assigningthe the structure and the probabilities are needed. Using theprinciple of indifference[Jaynes, 2001], we assign the prob-abilities

pC =

0 if D is inconsistent withC,

1 if D is surely consistent withC,1K otherwise.

(16)

Here K is the number of values thatD can take givenC.In this case, the structure will not affect the result, and theassumption that all data is independent can be used. Forexample, using the isolation structure (1) and assuming bi-nary tests, this givesP(d1 = 1|C = [NF,NF,F ]) = 1

2 andP(d1 = 1|C = [NF,NF,NF ]) = 0. This is basically the sameapproach as used in[de Kleer and Williams, 1987] and[deKleer and Williams, 1992].

5 The Isolation ProblemThe on-line isolation is solved by computing the posteriorprobability for the system behavioral modes, given the testresults at a certain time and using the structureH∗, the es-timated likelihoodsp∗C and the information about the priorP(C). Denoting the prior information withI, the posteriorprobability is

P(C|D,H∗, p∗C, I). (17)

To compute (17) efficiently, a Bayesian network can beset up, using the structure and the parameters learned fromthe structure problem. Standard algorithms for reasoning inBayesian networks can be used, see[Jensen, 2001] or [Lerner,2002] for examples. There are also algorithms for computingthek most likely explanations of the data, with even less com-plexity [Lerner, 2002].

6 Performance MeasureIn the present paper, isolation systems that can be expressedby a two-layer Bayesian network is designed. By choosingdifferent values ofL, different isolation systems within thisclass is designed. Further, there are the two extreme cases,assuming that all tests are independent, i.e.L = 1, and using

no assumptions on independence. LetI denote an isolationsystem. Then the output from the Bayesian isolation systems,the posterior, isP(C|D,I).

In order to compare the performance of two isolation sys-tems, a performance measure is needed. We suggest as an op-timal isolation system, a system that gives the posterior prob-ability one for the true underlying system behavioral mode.For probabilistic isolation systems, define the Expected prob-ability of correctness,

Definition 1 (Expected probability of correctness) LetDC∗ be data generated when the system behavioral mode C∗

is present, and let I be a probabilistic isolation system. Thenthe expected probability of correctness is

µ(C∗,I) = E P(C∗|DC∗ ,I) , (18)

where the expectation is over data.

The measureµ gives the expected probability assigned to thesystem behavioral mode that is really present. The optimalvalue ofµ is one. This measure gives one number for eachsystem behavioral mode, which is interesting since the behav-ioral modes can be differently difficult to isolate.

To summarize the expected probability of correctness intoone number, use the average over all system behavioralmodes,

µ(I) =1m ∑

C

µ(C,I). (19)

Another measure that relates to the isolation system per-formance is the probability that a correct diagnose is done,ifthe system behavioral mode with largest posterior probabilityis chosen as the diagnosis. In other words, given dataDC∗ ,from the system behavioral modeC∗, what is the probabilitythatP(C∗|D) is the largest posterior probability? We call thismeasure the expected probability of correct classificationandwrite it

µcc(C∗,I) = E

P

(C = C∗

), (20)

whereC = maxC

P(C|DC∗ ,I) (21)

and the expectation is over data. The optimal value ofµcc(I)is one. Also this measure gives one number for each behav-ioral mode. To summarizeµcc, use the average,

µcc(I) =1m ∑

C

µcc(C,I). (22)

Note that choosing the system behavioral mode withlargest probability as the diagnosis is only one of the inter-pretations of the output from the isolation system. There aremore clever ways to interpret the results. How to interpretthe results from an probabilistic isolation system is furtherdiscussed in Section 7.2.

7 Diesel Engine DiagnosisThe Bayesian isolation approach is applied to the diagnosisof the gas flow of a diesel engine with EGR (Exhaust GasRecirculation) and VGT (Variable Geometry Turbine). A


Inletmanifold

Exhaustmanifold

Exhaustsystem

neng

ntrb

pim , Tim pem , Tem pes , Tes

pamb

Tamb

Wcmp

Wtrb

Compressor

Turbine Restriction

EGRcooler

EGRValve

Wegr

δ

Figure 3: A schematic figure of the gas flow through thediesel engine with EGR and VGT

schematic figure of the gas flow is given in Figure 3. In thesystem there are ten components, to be diagnosed, listed inTable 1. In this example, all components considered are sen-sors, but other kinds of components, such as pipes, actuatorsetc. can be diagnosed with this method as well.

Table 1: The sensors in the engine systempem exhaust gas pressurepim inlet manifold gas pressureTim inlet manifold temperaturepamb ambient pressureTamb ambient temperatureuEGR EGR valve positionuvgt VGT valve positionwcmp flow through the compressorneng engine speedntrb turbine speed

To make the results easier to overview, only the three com-ponentspem, pim, and ntrb are diagnosed in this example,while the other seven are assumed to function correctly. Anextension to diagnosis of all sensors is straight forward. Allpossible combinations of faults of the three components areconsidered. This givesm = 8 possible system behavioralmodes. For the system behavioral modes we use a shortnotation. For example to denote that componentspem andpim are functioning correctly and componentntrb is faulty,we let C = [pem, pim,ntrb] = [NF,NF,F ] be represented byC = [001].

There exists a complex model of the diesel engine process,from which about 60 residual generators can be found[Einarsson and Ahrrenius, 2004]. Due to limitations in thecapacity of the on-board control unit, not all 60 residual gen-erators can be executed. In this example, five of the 60 resid-uals are used. The residuals are thresholded, and the thresh-olded residuals are used as the diagnostic tests. Here, the testsare binary, but this is not a requirement for the method. Theexperiments are done on data collected from the engine in areal driving situation.

Four different isolation systems, are set up:T No assumption of independenceH

∗ The most probable structure for some givenrequirements,H∗

N The naive assumption that the tests are inde-pendent,L = 1

D No training dataThe diagnosis systemT is of course infeasible when con-

sidering larger systems, but is used here because it uses allthe information given by the training data, and gives in somesense the best possible structure. The systemD, designedwithout training data, is implemented according to (16). Thisturned out to perform very poor on the current example, andthe result is only given in the summary of the experiments inTable 4.

For the design of the isolation systemH∗, the requirementL = 2 is used. This reduces the required storage capacity from28 = 256 for the systemT to, in worst case, 2×25+24 = 80.In practice, the storage capacity needed will be even smaller,since the tests in each partition will not be related to all com-ponents.

7.1 Experimental ResultsTo design the isolation systemH∗, the probabilities for allpossible structures withL = 2 were computed. The proba-bilities for the five most probable structures, normalized withthe probability of the most probable structure, are given inTable 2. It is clear that the partition[14, 23, 5] is far moreprobable than the other, and also that there are similaritiesbetween the most probable structures. Experiments are runover 10000 Monte Carlo simulations based on data from realdriving situations.

Table 2: The five most probable partitions normalized withthe probability of the most probable partition

Partition,Hi P(Hi|D)/P(H∗|D)

14, 23, 5 115, 23, 4 0.211, 23, 4, 5 0.1714, 23, 5 0.11, 23, 45 0.07

The prior probabilities for all three faults are assumed tobe equal,p(ci) = 0.1, i = 1,2,3, and, although not necessary,we assume that they break independently.

To compare the performance of the isolation systems, datasets from different system behavioral modes are applied tothe systems, and the probabilities for different diagnosesarecomputed. In Figures 4, 5, and 6 the probability, and its vari-ance, for three different system behavioral modes and thethree isolation systemsT, H

∗, andN are shown. The truebehavioral modes areC = [010], C = [110], andC = [110]respectively. For the first two system behavioral modes, theisolation systemsT andH

∗ assign largest probability to thecorrect system behavioral mode, while the the systemN doesnot. For the third system behavioral mode, in Figure 6, all


0

0.5

1

0

0.5

1

0

0.5

1

C

C

C

P(C

|C∗,T

)P(C

|C∗,H

∗)

P(C

|C∗,N

)

[000]

[000]

[000]

[001]

[001]

[001]

[010]

[010]

[010]

[011]

[011]

[011]

[100]

[100]

[100]

[101]

[101]

[101]

[110]

[110]

[110]

[111]

[111]

[111]

Figure 4: The average probability assigned to the differentsystem behavioral modes for the isolation systemsT (top),H

∗ (middle), andN (bottom). The lines show the variance.The true behavioral mode isC∗ = [010].

Table 3: The expected probability of correctness for three dif-ferent system behavioral modes.

Isolation µSystem C = [010] C = [110] C = [011]

T 0.84 0.69 0.11H

∗ 0.59 0.45 0.05N 0.30 0.18 0.02D 0.002 0.002 0.004

three isolation systems misses the underlying system behav-ioral mode.

The expected probability of correctness for the behavioralmodesC = [010],C = [110], andC = [011] are given in Table3. All the values ofµ are far from 1, even for the systemT, although no assumptions on independence are done in thissystem. The reason is that some system behavioral modes aredifficult to isolate, for exampleC = [011] shown in Figure 6.A numerical summation of all four isolation systems is givenin Table 4. The values ofµ andµcc of the isolation systemTis largest, followed by the system designed with our method,H

∗.The values ofµcc in Table 4 indicates that choosing the

system behavioral mode with the largest probability as the di-agnosis is not always a good way of interpreting the results.Itis also interesting to note that the naive isolation system,andour designed isolation system needs the same amount of stor-age capacity for the likelihoods. This is not true in general,althoughL can often be chosen so that the storage capacityneeded is significantly reduced compared to the system with-out restrictions.

The performance is very different for different system be-havioral modes. In general multiple faults are more difficultto detect than single faults. The reason is that the priors for

0

0.5

1

0

0.5

1

0

0.5

1

C

C

C

P(C

|C∗,T

)P(C

|C∗,H

∗)

P(C

|C∗,N

)

[000]

[000]

[000]

[001]

[001]

[001]

[010]

[010]

[010]

[011]

[011]

[011]

[100]

[100]

[100]

[101]

[101]

[101]

[110]

[110]

[110]

[111]

[111]

[111]



Table 4: The probability of correct classification and the av-erage probability of correctness for all systems.

Isolation system µcc µ storage needed

T 0.44 0.40 256H

∗ 0.29 0.25 80N 0.18 0.18 80D 0.13 0.13 80

multiple faults are very small compared to the priors for sin-gle fault or no fault. One solution to overcome this problemis to consider data from several time steps. In this case theprobabilityP(C|Dt ,Dt+1, . . .Dt+T ,I) for someT > 0 is usedinstead of the probabilityP(C|Dt ,I) as in the case above. Thiswill decrease the influence of the prior, and increase the in-fluence of the likelihood on the posterior. See for example[Jaynes, 2001].

7.2 DiscussionThe experimental results show that isolation system basedon the partitioned structure performs better than the isolationsystem based on the naive structure, but still it performs worsethan the structure assuming no independences. One questionis of course, how much performance can be gained using alargerL. The maximumL that can be used is given by restric-tions on the memory capacity of the on-board control unit, butalso a smallerL could perform sufficiently good. The accu-racy needed is dependent on how the output from the isolationsystem is to be evaluated.

One way to interpret the output from a probabilistic iso-lation system is to use a cost function, and compute the ex-pected cost of measures. The target is to minimize this ex-pected cost.

So far we have focused on the storage needed to imple-


0

0.5

1

0

0.5

1

0

0.5

1

C

C

C

P(C

|C∗,T

)P(C

|C∗,H

∗)

P(C

|C∗,N

)

[000]

[000]

[000]

[001]

[001]

[001]

[010]

[010]

[010]

[011]

[011]

[011]

[100]

[100]

[100]

[101]

[101]

[101]

[110]

[110]

[110]

[111]

[111]

[111]



ment the isolation system, and seen that it is dependent onL. Also, the number of hypothesisHq defined in Section 4is intersting, since too many hypotheses can give numericalproblems when solving the structure problem. The numberof hypotheses increases with increasingL, and with increas-ing number of diagnostic tests. To extend this work to largescale problems, the increased search space forH∗

q must behandled. This extension is a challenge, but beyond the scopeof this work.

8 ConclusionIn this paper Bayesian techniques for fault isolation is pre-sented. The structures of the underlying conditional probabil-ities is used as a design variable, and they are estimated fromtraining data. The Bayesian method was applied to diagnosisof the gas flow of a diesel engine.

Four different Bayesian isolation systems, with differentdegrees of dependence assumptions, were compared. The ex-periments was run on data from real driving situations. Theresult shows that if there is a dependence betweens tests, thisdependence is important to take into account when design-ing the isolation system. The system designed with the newmethod performs best of the systems with the same order ofcomplexity.

References[Colin N. Jones and Lawrence, 2002] Gregory W. Bond

Colin N. Jones and Peter D. Lawrence. Consistency-basedfault isolation for uncertain systems with applications toquantitative dynamic models. InDX 2002, pages 36–42,2002.

[de Kleer and Williams, 1987] Johan de Kleer and Brian C.Williams. Diagnosing multiple faults.Artif. Intell., 32:97–130, 1987.

[de Kleer and Williams, 1992] Johan de Kleer and Brian C.Williams. Diagnosis with behavioral modes. InReadingsin model-based diagnosis, pages 124–130, San Francisco,CA, USA, 1992. Morgan Kaufmann Publishers Inc.

[de Kleeret al., 1992] Johan de Kleer, Alan K. Mackworth,and Raymond Reiter. Characterizing diagnoses and sys-tems.Artif. Intell., 56(2-3):197–222, 1992.

[Einarsson and Ahrrenius, 2004] Henrik Einarsson and Gus-tav Ahrrenius. Automatic design of diagnosis systems us-ing consistency based residuals. Master’s thesis, UppsalaUniversity, 2004.

[Gertler, 1998] Janos J. Gertler.Fault Detection and Diag-nosis in Engineering Systems. Marcel Decker, New York,1998.

[Jaynes, 2001] B. T. Jaynes.Probability Theory - the logic ofscience. Camebridge University Press, Cambridge, 2001.

[Jensen, 2001] X. Jensen. Bayesian networks. Springer-Verlag, New York, 2001.

[Lerneret al., 2000] Uri Lerner, Ronald Parr, Daphne Koller,and Gautam Biswas. Bayesian fault detection and diagno-sis in dynamic systems. InAAAI/IAAI, pages 531–537,2000.

[Lerner, 2002] Uri Lerner. Hybrid Bayesian Networks ForReasoning About Complex Systems. PhD thesis, StanfordUniversity, Stanford University, October 2002.

[Lu and Przytula, 2005] Tsai-Ching Lu and K. Wojtek Przy-tula. Methodology and tools for rapid development oflarge bayesian networks. InDX 2005, pages 89–94, 2005.

[Pulidoet al., 2005] B. Pulido, V. Puig, T. Escobet, andJ. Quevedo. A new fault localization algorithm that im-proves the integration between fault detection and local-ization in dynamic systems. InDX 2005, 2005.

[Reiter, 1992] Raymond Reiter. A theory of diagnosis fromfirst principles. InReadings in model-based diagnosis,pages 29–48, San Francisco, CA, USA, 1992. MorganKaufmann Publishers Inc.

[Schwall and Gerdes, 2002] Matthew Schwall and ChristianGerdes. A probabilistic approach to residual processingfor vehicle fault detection. InProceedings of the 2002ACC, pages 2552–2557, 2002.

[Touaf and Ploix, 2004] Samir Touaf and Stephane Ploix.Soundly managing uncertain decisions in diagnosticanalysis. InDX 2004, 2004.

[Wolf, 1995] David Wolf. Mutual information as a bayesianmeasure of independence, 1995.


Automatic Generation of Benchmark Diagnosis Models

Gregory ProvanDepartment of Computer Science,

University College Cork, Cork, [email protected]

Abstract

We describe an algorithm for automatically gener-ating benchmark models that can be used for eval-uating diagnosis algorithms. Our algorithm gen-erates models based on a system structure speci-fied by a small-world network, which is a graph-ical structure that is common to a wide variety ofnaturally-occurring systems, ranging from biolog-ical systems, the WWW, to human-designed me-chanical systems. To demonstrate this approach,we randomly generate a suite of digital circuit mod-els with small-world network structure, and empir-ically show the computational complexity of diag-nosing these models.

1 Diagnostic Inference for Complex SystemsThe problem of model-based diagnosis (MBD) consists of de-termining whether an assignment of failure status to a set ofmode-variables is consistent with a system description and anobservation (e.g., of sensor values). This problem is knownto be NP-complete. However, this is a worst-case result, andsome NP-complete problems are known to be tractable forparticular problem classes. For example, graph colouring,which is NP-complete, has empirically been shown to haverun-times that depend on the graph structure [Cheeseman etal., 1991].

We are interested in the average-case complexity of MBDalgorithms on problem instances with real-world structure. Atpresent, it is not known whether MBD is computationally dif-ficult for the “average” real-world system. There has been nosystematic study of the complexity of diagnosing real-worldproblems, and few good benchmarks exist to test this.

We describe an algorithm for automatically generating di-agnostic benchmark models that can be used to analyse theperformance of diagnostic inference algorithms. This modelgenerator can be applied to any domain, and can generatemodels that accurately capture the properties of complex sys-tems, given as input a library of domain-dependent compo-nent models.

To demonstrate our approach, we generate a suite of com-binatorial circuit models, each of which possesses typicalreal-world properties, and empirically study the complex-ity of diagnostic inference within a model-based framework.

Our experimental results show that problems with real-worldstructural properties are computationally more difficult thanproblems with regular or random structure, such as would begenerated by a typical random-problem generator.

This article makes two main contributions. First, it de-scribes a technique to generate diagnosis models with real-world structural properties. This approach circumvents thedifficulty of assembling a large suite of test problems (bench-mark models), given that most large diagnosis models tend tobe proprietary. It also enables us to control model parameters(and hence analyse specific parameters). Second, we showempirically that diagnosing models with real-world structureis computationally hard. This provides the first clear experi-mental demonstration of this computational intractability.

We organize the remainder of the document as follows.Section 2 examines the topological structure that all real-world complex systems possess. Section 3 summarises themodel-based diagnosis task that we solve. Section 4 reviewsrelated work in the area of automated model generation. Sec-tion 5 describes the process we adopt for generating diagnos-tic models. Section 6 presents the experimental results. Fi-nally, Section 7 summarises our contributions and discussesthe wider implications of our results.

2 The Structure of Real-World Problems

Several recent theoretical studies and extensive data analyseshave shown that a variety of complex systems, including bio-logical [Newman, 2003], social [Newman, 2003], and techno-logical [Braha and Bar-Yam, 2004; i Cancho et al., 2001] sys-tems, share a common underlying structure, which is charac-terised by a small world graph. A small-world graph (SWG)is a complex network in which (a) the nodes form severalloosely connected clusters, and (b) every node can be reachedfrom every other by a small number of hops or steps.

We can measure whether a network is a small world or notaccording to two graph parameters: clustering coefficient andcharacteristic (mean-shortest) path length [Newman, 2003].The clustering coefficient, C, is a measure of how clustered,or locally structured, a graph is; this coefficient is an aver-age of how interconnected each agent’s neighbors are. Thecharacteristic path length, L, is the average distance betweenany two nodes in the network, or more precisely, the averagelength of the shortest path connecting each pair of nodes.


In the following, we will summarise the graph-theoreticnotation that we adopt to study inference complexity of MBD,and examine the data demonstrating the small-world proper-ties of technological systems.

2.1 Small-World Graph ParametersThis section introduces our notation. We assume that we havea graph G(V, E) with V the set of vertices and E the set ofedges. We say that V1 is a parent of V2 in G, denoted V1 =π(V2).

Definition 1 (Vertex Degree). The in-degree of a vertex in adigraph is the number of arcs coming to the vertex, and theout-degree is the number of arcs going out of the vertex. Thedegree k is the total number of incoming and outgoing arcs.

Definition 2 (Path). A path from a vertex x0 to a vertex xn ina digraphG = (V, E) is a sequence of vertices x0, x1, ....., xn

that satisfies the following: for each i, 0 ≤ i ≤ n − 1,(xi, xi+1) ∈ E, or (xi+1, xi) ∈ E, that is, between any pairof vertices there is an arc connecting them. x0 is the initialvertex and xn is the terminal vertex of the path.

SWG Characteristic 1: Mean Shortest Path LengthThe Characteristic Path Length L of a SWG is only a mean-ingful measure if a graph is fully connected, i.e., if there isa sequence of edges joining any two nodes. We adopt theconvention that L is infinite if a graph is not connected. Ingeneral, to make comparisons more feasible, all graphs wedeal with will be fully connected.

Definition 3 (Connected graph). A graph is said to be con-nected if there is a path between every pair of its vertices.

We define the distance between two vertices in a graph asfollows.

Definition 4 (Graph Distance). Given a graph G(V, E), thedistance L between two vertices is the number of edges in ashortest path connecting the two vertices.

As an example, for a random graph Gn,p, defined on nnodes where any pair of nodes is connected with probabilityp, the mean distance Lrand lnn

lnk.

SWG Characteristic 2: Clustering CoefficientThe notion of clustering characterises the degree of cliquish-ness of a typical neighbourhood in a graph.

We define the neighbourhoodN for a vertex v i as its imme-diately connected neighbours as follows: N i = vi : eij ∈E.

The degree ki of a vertex i is the number of vertices in itsneighbourhood, i.e., ki = |Ni|. The clustering coefficient Ci

for a vertex vi is the proportion of links between the verticeswithin its neighbourhood divided by the number of links thatcould possibly exist between them. For a directed graph, e ij

is distinct from eji, and therefore for each neighbourhoodN i

there are ki(ki − 1) links that could exist among the verticeswithin the neighbourhood. Thus, the clustering coefficient isgiven as:

Ci =|ejk ∪ ekj|

ki(ki − 1): vj , vk ∈ Ni, ejk ∈ E or ekj ∈ E.

(1)

This measure is 1 if every neighbour connected to v i is alsoconnected to every other vertex within the neighbourhood,and 0 if no vertex that is connected to v i connects to any othervertex that is connected to vi. The clustering coefficient forthe whole system is the average of the clustering coefficientfor each vertex [Watts and Strogatz, 1998]: C = 1

n

∑ni=1 Ci.

For a random graph Gn,p, the clustering coefficient Crand =kn .

We will use these properties in the following section to de-scribe technological systems that we want to diagnose.

2.2 Technological System Topology

Several recent studies of technological systems have shownthat they all possess small-world topology [Braha and Bar-Yam, 2004; i Cancho et al., 2001]. In these studies, eachtechnological system is described in graph-theoretic terms,and the underlying topology of the system graph G is stud-ied. For example, for the electronic circuits studied in [i Can-cho et al., 2001], the vertices of G correspond to electroniccomponents (gates, resistors, capacitors, diodes, etc.), and theedges of G correspond to the connections (or wires) betweenthe components. These circuits comprise both analog and IS-CAS’89/ITC’89 benchmark circuits, and all display C and Lparameters that are typical of SWG topologies.

Figure 1: Graph of analog TV circuit. Note the clustering ofnodes, especially the dense central cluster.

Figure 1 [i Cancho et al., 2001] shows the topology graphof an analog TV circuit containing 329 components. This fig-ure shows how this graph has clear clusters (especially thedense central cluster). In addition, there are short paths be-tween any pair of components (nodes) in the network. Thefirst two rows of Table 1 compares the clustering and dis-tance parameters with the corresponding random-graph pa-rameters, showing that C Crand and L Lrand. Thethird row of Table 1 shows a large circuit taken from theISCAS’89/ITC’89 benchmark, which displays small-worldtopology similar to the two smaller circuits [i Cancho et al.,2001].


Circuit N C Crand L Lrand

logic 320 0.053 0.0099 5.06 4.99analog 329 0.34 0.019 3.17 3.13ISCAS 24,097 0.03 0.00015 11.05 4.38

Table 1: SWG data for circuits. N is the number of nodes;C and Crand denote the clustering coefficient for the circuitand corresponding random graph model; L and L rand denotethe mean distance for the circuit and corresponding randomgraph model.

3 Model-Based DiagnosisWe can characterise a MBD problem using the triple〈COMPS, SD, OBS〉 [Reiter, 1987], where:

• COMPS=C1, ..., Cm describes the operating modesof the set of m components into which the system is de-composed.

• SD, or system description, describes the function of thesystem. This model specifies two types of knowledge,denoted SD = (S,B), where the system structure, S,denotes the connections between the components, andthe system behaviour, B, denotes the behaviour of eachcomponent.

• OBS, the set of observations, denotes possible sensormeasurements, which may be control inputs, outputs orintermediate variable-values.

We adopt a propositional logic framework for our diagnosticmodels. Component i has associated mode-variable C i; Ci

can be functioning normally, denoted as [C i = OK], or cantake on a finite set of abnormal behaviours.

MBD inference assumes initially that all components arefunctioning normally: [Ci = OK], i = 1, ..., m. Diag-nosis is necessary when SD ∪ OBS ∪ [Ci = OK]|Ci ∈COMPS is proved to be inconsistent. Hypothesizing thatcomponent i is faulty means switching from [C i = OK] to[Ci = OK]. A (minimal) diagnosis is thus a (minimal) subsetC′ ⊆ COMPS such that: SD ∪OBS ∪ [Ci = OK]|Ci ∈COMPS \ C′ ∪ [Ci = OK]|Ci ∈ C′ is consistent.

In this article, we adopt a multi-valued propositional logicusing standard connectives (¬,∨,∧,⇒). We denote variableA taking on value α using [A = α]. An example equation is[A = t] ∨ [B = f ]⇒ [C = t].

4 Automatic Benchmark Generation: RelatedWork

This article addresses the automated generation of bench-marks for model-based diagnosis. The literature does notcontain any work, to our knowledge, that addresses this taskfor applications other than for circuit diagnosis.

The most closely-related work in the literature is the workon diagnostic model generation for circuits [Vogels et al.,2004]. This work addresses the detailed simulation of cir-cuit defects (such as metal spot defects or defects in circuitgeometry), which itself if a big task. This methodology is im-portant in that very few other researchers have addressed the

need to have libraries of components with detailed physics-based failure-mode definitions. This approach, however, hasfocused on very small circuits, such as a 4-bit ALU, and doesnot use algorithms for generating arbitrary circuit topologies.Further, the defect simulation cannot be generalised beyondcircuits.

A second group of related work addresses automaticbenchmark circuit generation for improving the design of pro-grammable logic architectures [Hutton et al., 2002; Christieand Stroobandt, 2000]. Benchmark circuit auto-generationoriginally was based on applying a circuit generation rule,called Rent’s rule [Landman and Russo, 1971], but has sinceexpanded to include other methods.1

We now describe the basic methodology of automatic dis-crete circuit generation, pointing out the similarities and dif-ferences to our approach for automating the generation of di-agnostic models for circuits and other domains.

Most automatic circuit generation methods are based onone of two methods, which we call equivalence-class andRent-based methods. The equivalence-class methods [Ghoshand Brglez, 1999] are based on perturbing a seed circuit togenerate a circuit with similar overall structure but differentlocal connectivity. The Rent-based methods use a power-lawmethodology, called Rent’s rule, to generate circuits [Christieand Stroobandt, 2000]. Both methods can generate combi-national and sequential circuits, where we define a combi-national circuit as one without any distinguished clock in-puts (e.g., as provided by D-type Flip-Flop components), anda sequential circuit as a circuit with distinguished clock in-puts.2 Most auto-generation algorithms first create the com-binational circuits, and then use a hierarchical approach togenerate the sequential circuits for each level of delay [Hut-ton et al., 2002].

In the following, we examine the combinational circuitgeneration process, since this process has some propertiesthat are potentially generalisable to any system model; se-quential circuit generation addresses issues that are restrictedto a specific class of temporal feedback systems with distin-guished clock inputs, features that are not present in manyother domains. Moreover, because of its greater generality,we focus in this article on the Rent-based combinational cir-cuit methods.

Rent’s rule [Landman and Russo, 1971], was originally de-rived empirically, but has since been given mathematical un-derpinnings. Rent’s rule describes the relationship betweenthe number of external signal connections to a logic block(called the number of “pins”) and the number of logic gatesin the logic block.

Rent’s rule is given by:

T = tnξ,

where (a) T is the number of input/output pins, 3 (b) n is

1See [Chang et al., 2003] for a survey.2In this article, we focus on atemporal models, which translates

to combinational circuits.3In graph-theoretic terms, if we represent component i using

a node in a topology graph, then the degree ki of component icorresponds to the set of terminals of component i in the circuit-generation domain.


the number of gates, (c) and the (internal) Rent exponent0 ≤ ξ ≤ 1 represents the level of placement optimizationwithin a statistically homogeneous circuit, which is charac-terized by an interconnection topology with an average nodedegree t (or in engineering terms, t terminals per gate). Froman engineering perspective, ξ = 1 corresponds to no place-ment optimization, i.e., the circuit is interpreted as a randomgate arrangement. In actual circuits, the parameter ξ is de-pendent on circuit-topology: microprocessors, gate-arrays,and high-speed computers are characterized by Rent expo-nents of ξ = 0.45, 0.5, and 0.63, respectively [Christie andStroobandt, 2000].

Several tools have been developed to generate benchmarkcircuits based on Rent’s rule and other approaches. Examplesof such tools are CIRC and GEN.4 If one is interested in gen-erating benchmark diagnostic circuits, then these tools canbe integrating within the diagnostic model-generation frame-work described in this article.

Circuit generation algorithms have proven very useful forapplications like FPGA design; however, they are restricted toa specific domain, and focus on topology optimisation, ratherthan on the issues of fault isolation that are relevant to di-agnosis benchmarks. As a consequence, we have developeda more general approach to benchmark generation that hassome commonality with circuit generation algorithms, butalso some key differences.

The key commonality between our approach and these cir-cuit generation algorithms is that we first generate the under-lying system topology, using a graph generation algorithm.Specifically, we use a graph generation algorithm that gener-ates a graph with a power-law topology. This approach is ageneralisation of the Rent-based topology algorithm, in thatRent’s rule uses a power-law method that is almost identicalto the power-law approaches developed within the random-graph community. Both the Rent-based and random-graphmethods focus on defining a graphical structure G(V, E) inwhich the nodes V correspond to components and the edgesE correspond to wires between the components.

Key differences between these areas include (1) the exten-sion of the system topology to incorporate functionality, and(2) the tuning of the topology and functionality. With regardto (1), our diagnostic benchmark generator extends the sys-tem topology to incorporate a functional description that de-scribes both normal and anomalous system behaviours. Withregard to (2), the diagnostic benchmark generator methodol-ogy has parameters that can be tuned to generate models toapproximate particular domains, but assumes that these pa-rameters are domain-dependent and need to be supplied bydomain experts. In the absence of good domain parameters,the generated models will approximate real models with goodaccuracy, the quality of which can be improved with the useof precise parameters.

Random graph generators can effectively capture the grosstopology of complex systems, but much work remains tomore precisely capture detailed structure of particular do-mains. For example, the actual structure of the WWW isknown to differ from the predictions of random graph mod-

4See http://www.eecg.toronto.edu/∼jayar/software/software.html.

els [Donato et al., 2004]. In contrast, the practical applica-tions and validity of the circuit-synthesis methods are moreheavily-researched than the applications and validity of therandom-graph generation approach; as a consequence, themodels that a circuit-synthesis method generates are provablycloser to the real-world targets (circuits) than are the modelsgenerated by random-graph generators are to their real-worldtargets, such as the WWW [Donato et al., 2004]. However,many aspects of the circuit-generation algorithms are so par-ticular to the precise architectures of circuits that they are notgeneralisable to other domains.

5 Benchmark Diagnostic Model GenerationThis section describes our algorithm for generating bench-mark diagnostic models. Figure 2 depicts the process of au-tomatically generating diagnostic models and using them forevaluating diagnosis inference algorithms. Our approach isapplicable to any domain, since (a) the underlying topolog-ical models have been shown to approximate virtually anycomplex system [Newman, 2003], and (b) functionality is in-corporated into the system model using a component-library,where components can be developed for any domain in whichthe system models are decomposable.

The topology-generation method we adopt was originallydeveloped based on the theory of random graphs–see [New-man, 2003] for background in this area. However, thismethod focuses solely on the system structure (as capturedby the graph), and ignores the system functionality. We ex-tend this approach by adopting the system structure based onthe random-graphgenerators, and then encoding system func-tionality using a component library.

ComponentLibrary

ModelTopologyGenerator

DiagnosisModel

Generator

InferenceAlgorithm

Library

ModelSuite

Test-CaseSuite

EvaluationOf

InferenceAlgorithms

Model Generator ModelEvaluator

Figure 2: The steps of automated model generation and anal-ysis.

Generation Algorithm: We generate diagnostic (bench-mark) models in a three-step process.

1. generate the (topology) graph G underlying each model;

2. assign components to each node in G for system Ψ,to create a Model-Based Diagnosis graph (MBD-graph)G′;

3. generate the system description (and fault probabilities)for Ψ.

We now describe this process using an example, and thendescribe each step of the process.


Example 1. To demonstrate this approach, we study a suiteof auto-generated electronic combinational circuits, whichare constructed from simple gates. The inputs to the genera-tion process consist of: (a) a component library; (b) parame-ters defining the system properties, such as the number n ofcomponents; and (c) domain-dependent parameters, such asthe Rent parameter ξ. As an example, Figure 3 shows severalof the gates that we use in our component library, togetherwith truth-tables for the gates (as one method of the describ-ing the functionality of each gate). We study networks withn=50, 60, 70 and 80 components, and generate circuits usingdomain-dependent Rent parameter of ξ = 0.5.

Figure 3: Partial component library for combinatorial digitalcircuit domain. Each gate also has an associated truth-tabledefining the gate’s functionality.

A key difference between generic circuit models and diag-nosis models is that the diagnosis models explicitly encodefailure modes and functional effect of failure modes. As aconsequence, the structure of a diagnostic model is slightlydifferent than the structure of the corresponding electroniccircuit, since a diagnostic model explicitly encode failuremodes of components.

Example 2. Figure 4 shows the schematic of a simple circuitwith componentsA, B, C, D and E. The circuit has two inputs,I1 and I2, and the output of component i is denoted by O i.Figure 5 shows the process of transforming this schematicinto a MBD-graph, which is the basis for constructing a diag-nostic model. We first translate the schematic into a topologygraph, which makes the graphical topology of the circuit ex-plicit by denoting each component as a node, the inputs asnodes, and the wires linking inputs to components or compo-

nents to components as directed edges.5 Next we replace eachcomponent X in the topology graph with a pair (CX , OX),which denotes the mode and output of component X , respec-tively. We introduce the mode-variables for each component,in order to diagnose the fault status of each component. Fur-ther, this new structure of the MBD-graph enables us to define(model-based) behavioural equations for each component.

A B D

C E

I1

OD

OE

OA OB

OCI2

Figure 4: Schematic of simple electronic circuit.

We formally specify the topology graphG and MBD-graphG′ as follows.Definition 5 (Topology graph). A topology graph G(V, E)for a system 〈COMPS, SD, OBS〉 is a directed graphG(V, E) corresponding to the system structure S. Hence inG(V, E): (a) the nodes V consist of a collection of nodes cor-responding to system components (χ), system inputs (η) andoutputs (ζ), i.e., V = χ∪η∪ ζ; and (b) the edges correspondto connections between two component-nodes, between aninput-node and a component-node, or between a component-node and an output-node, i.e., E = (χ i, χj) ∪ (ηi, χk) ∪(χl, ζm), for χi, χj , χk, χl ∈ χ, ηi ∈ η, and ζm ∈ ζ.

Definition 6 (MBD-graph). An MBD-graph G′(V ′, E′) isa topology graph G(V, E) in which each component nodeχi ∈ V is replaced with a subgraph consisting of the nodefor the corresponding component-output O i, the node cor-responding to component-mode Ci, and the directed edge(Ci, Oi).6 Hence in G′(V ′, E′): (a) the nodes V ′ consistof a collection of nodes corresponding to system component-outputs (O), mode-variables (COMPS), system inputs (η)and outputs (ζ), i.e., V = O ∪ COMPS ∪ η ∪ ζ; and (b)the edges correspond to connections between two component-output-nodes, between an input-node and a component-node,or between a component-node and an output-node, i.e., E =(Oi, Oj)∪(ηi, Ok)∪(Ol , ζm)∪(Ci, Oi), for Oi, Oj , Ok, Ol ∈O, ηi ∈ η, Ci ∈ COMPS, and ζm ∈ ζ.

5.1 Generate Graph Structure for GWe generate a small-world-graph (SWG) using the approachof Watts and Strogatz [1998], as this methodology has beenshown to generate graphs with mean distance and clusteringcoefficient that closely match real-world systems [Newman,2003]. The Watts and Strogatz approach generates a graph Gwith a pre-specified degree of randomness that is controlledby a probability p ∈ [0, 1]. p 0 corresponds to a reg-ular graph, and p 1 corresponds to an Erdos-Renyi ran-dom graph; graphs with real-world structure (SWGs) occur

5This is the graphical framework used for the small-world anal-yses of electronic systems in [i Cancho et al., 2001].

6Figure 5 shows this replacement process.


A

B C

D E

I1

I2

(a) Topology Graph

OA

OB OC

OD OE

(c) MBD Graph

I1

I2

CA

CB

CD CE

CC

A

OA

CA

(b) Translation to MBD graph

Figure 5: Transforming the topology graph of a simple elec-tronic circuit into a model-based diagnosis graph.

roughly in the range .01 ≤ p ≤ .8, as has been determinedby empirically comparing the C and L parameters of gener-ated graphs and actual networks [Newman, 2003]. As notedearlier, a node in G corresponds to a system component, andan edge linking nodes i and j in G corresponds to a wirebetween components i and j. We randomly assign a set Oof nodes to be observable inputs and outputs, choosing |O|based on system size, using Rent’s rule. More precisely, weuse |O| = kn0.5, taking Rent parameter ξ = 0.5 and k as themean node degree.

We can summarise the graph generation process as follows.Figure 6 depicts this process, where we control the proportionof random edges using a rewiring probability p. We start witha regular graph (a ring lattice of n nodes), where each nodeis connected to its k nearest neighbors. We then introducerandom edges, i.e., with probability p we randomly “rewire”an edge by moving one of its ends to a new position chosenat random from the rest of the lattice [Watts and Strogatz,1998]. We characterise a SWG H that is generated using therepresentation H(n, k, p).

Large LLarge C

Small LSmall C

Small LLarge C

Increasing randomness ρ=1ρ=0

Figure 6: Generating a small-world graph from a regular ringlattice with rewiring probability p.

5.2 Assign Components to graph G

Given a topology graph G, we associate to each node in Ga component, based on the number of incoming arcs for thenode. Given a SWG node with i inputs and o outputs, we as-sign a component, denoted ΨZ(i, o, τ,B, w) where τ denotes

the type (e.g., AND-gate, OR-gate),B defines the behaviouralequations of component Z , and w the weights assigned to thefailure modes of Z .

Example 3. For our experiments, we use a set of digital com-parator components, as shown in Figure 3. We have also ex-tended “selector” components, which we characterise as fol-lows: a j-of-k gate will output t if at least j out of the k inputsare t.

Given a node that has q possible components that are suit-able, we randomly select a componentwith probability 1

q . Forexample, the single-input nodes correspond to single-inputgates (NOT, buffer), and the dual-input nodes correspond todual-input gates (AND, OR, NAND, NOR, XOR).

5.3 Generate the System Description

Given a selected component, we then generate its normal-mode equations (and potentially failure-mode equations). Werandomly select the mode type (of the k possible failuremodes) for any component-model with probability 1

k . We as-sign weights to failure-mode values by assuming that normalbehaviour is highly-likely, i.e., PrCi = OK 0.99, andfaulty behaviour is unlikely, i.e., PrCi = OK 0.01.

invertSA0

SA1SA1

SA1

OD

OE

OA OB

OC

I1

I2

A B

C

D

E

Figure 7: Instantiated schematic of simple electronic circuit.Components A, D and E are NOT gates, component C is anAND gate, and component B is a buffer.

Example 4. Figure 8 shows a randomly-generated circuitbased on the schematic of Figure 4. Here, we instantiate com-ponents A, D and E to NOT gates, component C to an ANDgate, and component B to a buffer. This figure also depictsthe instantiated failure-mode for the components in shadedboxes: Components B, C and E have SA1 fault-modes, com-ponent A has a SA0 fault-mode, and component D has aINVERT fault-mode. Given this information, we can gener-ate a system description with equations corresponding to thecomponent-types and fault-mode types as just described. For


example, the equations for gates A and C are as follows:

A : [I1 = t] ∧ [MA = OK] ⇒ [OA = f ]

[I1 = f ] ∧ [MA = OK] ⇒ [OA = t]

[MA = SA0] ⇒ [OA = f ]

C : [I2 = t] ∧ [OA = t] ∧ [MC = OK] ⇒ [OC = t]

¬ ([I2 = t] ∧ [OA = t]) ∧ [MC = OK] ⇒ [OC = f ]

[MC = SA1] ⇒ [OC = t]

6 Experimental AnalysisGiven a set of diagnosis models, we studied the inferencecomplexity of each model Ψ by assigning observations (OBS)and then measuring the inference complexity for computinga minimal diagnosis given (Ψ, OBS).

We have adopted the causal network approach [Darwiche,1998] for our experiments. We generated models containing50, 60, 70 and 80 components for 103 different values of pranging from 0 to 1.0, to cover the full range of regular, small-world and random graph structures. We used as our measureof inference complexity the sum of all clique-tables in thecausal network model, which is a typical complexity measurefor this type of model.7

Figure 6 shows our results, where each data point is theaverage of 300 runs. The plot for each model of size n hasthe same shape; the shape occurs since regular graphs (p near0) are computationally simple, small-world graphs are com-putationally hard (with a complexity peak near p 0.7), andrandom graphs (p 1) also are hard, but less hard than thecomputational peak in the small-world region. The correla-tion of cluster-size and inference complexity presented here isconsonant with the analysis of clique sizes (and correspond-ing complexity of probabilistic inference), using ISCAS’85benchmark circuits [Fattah and Dechter, 1995].

p 1.00.80.60.40.20.0

100000000

10000000

1000000

100000

10000

1000

Log

(Cliq

ue-T

able

siz

e)

n=50

n=60

n=70

n=80

n=80

n=50

n=60

n=70

Figure 8: Results of diagnosing SWG-structured digital cir-cuits. Each curve shows a model containing n components.

7This is because the complexity of causal network inference isexponential in the largest clique of the graph (or the graph width)[Darwiche, 1998].

7 ConclusionThis article has described a method for generating diagnos-tic models that have real-world topology and can be tailoredto different domains. We have generated models of systemscomposed of digital circuits, but can generalise this approachto any domain where systems can be composed from a libraryof components. This method circumvents the problems withusing random-graphs for experiments, and provides an alter-native to developing suites of hand-built models for bench-marking.

This article has also empirically shown that MBD prob-lems with a topology typical of real-world systems, i.e.,with a SWG structure, are computationally hard. In fact,over the space of topologies ranging from regular to ran-dom, the SWG structure is the computationally least tractablefor MBD, a property shared with other inference tasks, suchas graph colouring, timetabling and quasi-group analysis[Walsh, 1999].

We are currently comparing the diagnostic performance ofauto-generated models with real-world benchmark models,to further extend the comparison from [Fattah and Dechter,1995].

References[Braha and Bar-Yam, 2004] Dan Braha and Yaneer Bar-

Yam. Topology of large-scale engineering problem-solving networks. Physical Review E, 69:016113, 2004.

[Chang et al., 2003] Chin-Chih Chang, Jason Cong, and MinXie. Optimality and scalability study of existing placementalgorithms. In ASPDAC: Proceedings of the 2003 confer-ence on Asia South Pacific design automation, pages 621–627, New York, NY, USA, 2003. ACM Press.

[Cheeseman et al., 1991] Peter Cheeseman, Bob Kanefsky,and William M. Taylor. Where the Really Hard ProblemsAre. In Proc. IJCAI-91, pages 331–337, 1991.

[Christie and Stroobandt, 2000] Philip Christie and DirkStroobandt. The interpretation and application of Rent’srule. IEEE Trans. Very Large Scale Integr. Syst., 8(6):639–648, 2000.

[Darwiche, 1998] Adnan Darwiche. Model-based diagnosisusing structured system descriptions. J. Artificial Intelli-gence Research, 8:165–222, 1998.

[Donato et al., 2004] Debora Donato, Luigi Laura, StefanoLeonardi, and Stefano Millozzi. Simulating the webgraph:A comparative analysis of models. Computing in Scienceand Engg., 6(6):84–89, 2004.

[Fattah and Dechter, 1995] Yousri El Fattah and RinaDechter. Diagnosing tree-decomposable circuits. InIJCAI, pages 1742–1749, 1995.

[Ghosh and Brglez, 1999] D. Ghosh and F. Brglez. Equiv-alence classes of circuit mutants for experimental design.In Proceedings of the 1999 IEEE International Symposiumon Circuits and Systems (ISCAS ’99), pages 432–435,NewYork, NY, USA, 1999. ACM Press.


[Hutton et al., 2002] Michael D. Hutton, Jonathan Rose, andDerek G. Corneil. Automatic generation of synthetic se-quential benchmark circuits. IEEE Trans. on CAD of Inte-grated Circuits and Systems, 21(8):928–940, 2002.

[i Cancho et al., 2001] Ramon Ferrer i Cancho, ChristiaanJanssen, and Ricard V. Sole. Topology of technologygraphs: Small world patterns in electronic circuits. Physi-cal Review E, 64(4):046119, 2001.

[Landman and Russo, 1971] B. S. Landman and R. L. Russo.On pin versus block relationship for partitions of logic cir-cuits. IEEE Trans. Computers, 20:1469–1479, 1971.

[Newman, 2003] M. E. J. Newman. The structure and func-tion of complex networks. SIAM Review, 45(2):167– 256,2003.

[Reiter, 1987] R. Reiter. A Theory of Diagnosis from FirstPrinciples. Artificial Intelligence, 32:57–96, 1987.

[Vogels et al., 2004] T. Vogels, T. Zanon, R. Desineni, R.D.Blanton, J.G. Maly, W. Brown, J.E. Nelson, Y. Fei,X. Huang, P. Gopalakrishnan, M. Mishra, V. Rovner, andS. Tiwary. Benchmarking diagnosis algorithms with a di-verse set of IC deformations. In Proceedings InternationalTest Conference (ITC 2004), October 2004.

[Walsh, 1999] Toby Walsh. Search in a small world. In IJ-CAI, pages 1172–1177, 1999.

[Watts and Strogatz, 1998] Duncan J. Watts and Steven H.Strogatz. Collective dynamics of “small-world” networks.Nature, 393:440–442, 1998.


Robust Fault Detection using Set-membership Estimation and ConstraintsSatisfaction

Vicenç Puig∗, Carlos Ocampo-Martínez, Sebastián Tornil and Ari Ingimundarson

Automatic Control DepartmentUniversitat Politècnica de Catalunya (UPC)

Rambla de Sant Nebridi, 10, 08222 Terrassa (Spain)

Abstract

In this paper, the robust fault detection problem fornonlinear systems considering both bounded para-metric modelling errors and noises is addressed.Fault detection is formulated as a set-membershipestimation problem and a state estimator that des-cribes the set of all the states consistent with mod-elling uncertainty, measured data and noise boundsis presented, being this the main contribution of thepaper. An implementation based on constraint sat-isfaction of the state estimator is proposed and ap-plied to fault detection. Finally, the proposed ap-proach is applied to detect faults in limnimeters ofa piece of Barcelona sewer network.

1 IntroductionModel-based fault detection methods rely on the concept ofanalytical redundancy. The simplest analytical redundancyapproach consists on the comparison of measurements of asystem output with corresponding analytically computed val-ues. These values are obtained from measurements of othervariables and/or from previous measurements of the samevariable by means of a model. In the general case, differentestimations of a same variable, measured or not, can be com-pared. The resulting differences are calledresidualsand theyare indicative of faults in the system. Under ideal conditions,residuals are zero in the absence of faults and non-zero whena fault is present. However, modelling errors, disturbancesand noises in complex engineering systems are inevitable,and hence there is a need to develop robust fault detection al-gorithms. The robustness of a fault detection system indicatesits ability to distinguish between faults and model-reality dif-ferences[Chen and Patton, 1999].

Classical approaches facing disturbances and modelling er-rors use the disturbance decoupling principle trying to obtaina residual that is sensitive to faults but not to these errors.Techniques like unknown input observers, eigenstructure as-signment[Chen and Patton, 1999] or structured parity equa-tions [Gertler, 1998], among others, can be found in the lit-erature. On the other hand, process and measurement noises

∗Fax: +34 93 739 8628. E-mail address of corresponding author:[email protected]

are usually modelled in a stochastic way (the typical assump-tion is a zero mean white noise) and their effect is consideredusing statistical decision methods[Basseville and Nikiforov,1993].

However, such approaches present several drawbacks.First, decoupling from modelling errors (specially for non-linear models) is difficult to solve because the distribu-tion matrix is normally unknown, time varying and shouldbe estimated. Moreover, the number of decoupled distur-bances/modelling errors is limited by the degree of freedomin the residual generation procedure[Gertler, 1998]. As analternative strategy, disturbances/model errors are assumed tobe bounded and its effect is propagated to the residual using,for example, interval methods[Puiget al., 2002]. Second, inmany practical situations it is not realistic to assume an sta-tistical distribution law for the noise, being more naturaltoassume that only bounds on the noise signals are available. Inthis case, the so calledset-membership approach[Milaneseet al., 1996] can be used in the context of fault detection assuggested by[Witczak et al., 2002]. In both cases, the ad-vantage of the bounded description of uncertainty is that itdoes not require restrictive assumptions (as small number ofunknown disturbances/parameters, known statistical distribu-tion law). However, a limitation is that faults that producearesidual deviation smaller than the residual uncertainty due tomodel uncertainty will be missed (missed alarms).

In this paper, the robust fault detection problem for nonlin-ear systems considering both bounded parametric modellingerrors and noises is addressed, being this the main contribu-tion. Fault detection is formulated as a set-membership es-timation problem. A state estimator that describes the set ofall the states consistent with modelling uncertainty, the mea-sured data and noise bounds will be presented. Several re-searchers as[Shamma, 1997], [Jaulinet al., 2001a], [Kiefferet al., 2002] and [Calafiore, 2001], among others, have ad-dressed this issue. Unfortunately, the set obtained in suchwaymay become extremely complex due to the nonlinear natureof the model[Kieffer et al., 2002]. In [Witczaket al., 2002],a state estimator based on enclosing the set of states usingthe smallest ellipsoid is proposed following the algorithmsin [Maksarov and Norton, 1996]. However, in this approachonly additive uncertainty is considered, but not the multiplica-tive uncertainty introduced by unknown parameters. An inter-esting contribution to this problem was presented in[Rinner


and Weiss, 2004]. There, a method for fault detection of sys-tems with interval uncertainty was presented which was basedon dividing the parameter space of the model into parts andchecking consistency of each part separately. If consistencycheck failed for a part, that part was dropped from consider-ation as a part of the parameter space. The consistency checkwas performed by integrating corner points of the parts of theparameter space to obtain bounds on the trajectories one sam-ple at a time. For the consistency check to be exact, mono-tonicity conditions were necessary on the differential functionfor the part of the parameter space that was being checked.

Two possible implementations of a state estimator thattakes into account parameter uncertainty and bounded pro-cess/measurement noise are presented: one based on set com-putations[Kieffer et al., 2002] and the other based on con-straint satisfaction techniques from[Jaulin et al., 2001b].However, due to the computational complexity of the methodbased on set-computations, finally only the method based onconstraints satisfaction is considered. The fault detection ap-proach presented in this paper can be considered as an im-provement of the approach to robust fault detection proposedby [Rinner and Weiss, 2004], that only considers system tra-jectories obtained from the uncertain parameter interval ver-tices assuming that the monotonicity property holds.

The paper is organized as follows: InSection2, the prob-lem of fault detection for nonlinear time-varying systems us-ing set-membership estimation is presented. InSection3, theimplementation of set-membership state estimation in faultdetection using set computations is discussed.Section4 ad-dresses the implementation using constraints satisfaction. Fi-nally, in Section5 the proposed approach is applied to detectfaults in limnimeters of a piece of Barcelona sewer network.

2 Set-Membership State Estimation appliedto Fault Detection

2.1 The set-membership estimation approachLet us consider the following discrete-time nonlinear systemdescribing the behavior of the system to be monitored:

xk+1 = g(xk, uk, θk) + wk (1a)

yk = h(xk, uk, θk) + vk (1b)

where:

• x ∈ X ⊆ Rn is the vector of system states,u ∈ R

m isthe vector of system inputs andy ∈ R

p is the vector ofsystem outputs.

• wk ∈ Rn andvk ∈ R

p are process and measurementnoises which are considered unknown but bounded, i.e.,vk ∈ V andwk ∈ W, whereV andW are the intervalboxes:

V = v ∈ Rp | v ≤ v ≤ v (2)

W = w ∈ Rn | w ≤ w ≤ w (3)

• g andh are the state space and measurement nonlinearfunctions.

• X0 describes the set of initial states as

X0 = x ∈ Rn | x0 ≤ x ≤ x0 (4)

• θk ∈ Rq is a vector of uncertain time-varying parameters

with their values bounded by a compact setθk ∈ Θ ofbox type:

Θ =θ ∈ R

q | θ ≤ θ ≤ θ

(5)

Let us denote the following sequences from the first timeinstant to time instantk:

uk = (uj)k−10 = (u0, u1, ..., uk−1)

yk = (yj)k

0 = (y0, y1, ..., yk)

wk = (wj)k−10 = (w0, w1, ..., wk−1)

vk = (vj)k

0 = (v0, v1, ..., vk)

θk = (θj)k−10 = (θ0, θ1, ..., θk−1)

Definition 2.1 (Set-membership estimation)Given a sys-tem described by Eq. (1), an initial compact setX0 and thesequences of measured inputsuk and outputsyk, the set of es-timated states at timek using the Set-membership approachis expressed by

Xk =

xk | ∃w, v, θ, x0 such that

(xj = g(xj−1, uj−1, θj−1) + wj−1)k

j=1 ,

(yj = h(xj , uj , θj) + vj)k

j=0

2.2 Fault detection using set-membershipestimation

Definition 2.2 (Fault) Given the sequences of measured in-puts uk and outputsyk of the actual system, afault is saidto have occurred at timek if there does not exist a set ofsequencesvk, wk and θk which satisfy the nominal systemdescription given in Eqs. (1) with initial conditionX0 andnoise, disturbances and parameters belongV, W andθ, re-spectively.

According to this definition, a fault can be detected using aset-membership estimator when the set of estimated states inDefinition2.1 is the empty set.

3 Fault Detection based on Set-membershipState Estimation using Set Computations

3.1 Preliminary definitionsThe computation of the set of estimated states can be imple-mented admitting the rupture of the existing relations betweenvariables of consecutive time instants. This makes possiblethe determination of the set at time instantk using the setobtained at time instantk − 1. In order to implement an al-gorithm for this approach, the following definitions shouldbeconsidered.

Definition 3.1 (Predicted states set)Considering the sys-tem given in Eq. (1), the inputuk−1 and the set estimatedstatesX e

k−1, the set of predicted states at timek is given by

X pk = xk = g(xk−1, uk−1, θk−1) + wk−1 |

xk−1 ∈ X ek−1, θk−1 ∈ Θ, wk−1 ∈ W


Definition 3.2 (Consistent states set)Considering the sys-tem given in Eq. (1), the inputuk and a measured outputyk,the set of consistent states at timek with such measurement isgiven by

X ck = xk | ∃θk ∈ Θ, vk ∈ V such that

yk = h(xk, uk, θk) + vk

Definition 3.3 (Estimated state set)Given the predictedstate setX p

k and the consisted state setX ck for the system in

Eq. (1) at timek, the set of estimated state is defined by

X ek = X p

k ∩ X ck

It can be proved that theXk ⊆ X ek according to[Kieffer

et al., 2002], whereXk is the exact set of consistent states asintroduced inDefinition 2.1 andX e

k is the approximated setof consistent states computed as inDefinition3.3.

3.2 Algorithm descriptionUsing the sets introduced in previous definitions,Algorithm1 allows the computation of the estimated states set and itsapplication to fault detection is presented.

Algorithm 1 Fault Detection using Set Computations1: X e

k ⇐ X0

2: for k = 1 to N do3: ComputeX p

k4: ComputeX c

k

5: ComputeX ek = X p

k ∩ X ck

6: if X ek = ∅ then

7: Exit (Fault detected)8: end if9: end for

In the case of linear systems and considering only additiveuncertainty, the estimated states set take generally the formof (convex) polytopes, existing in the literature efficientalgo-rithms exist to deal with[Chisci et al., 1996]. However, inthe nonlinear case (or in the linear case with multiplicativeuncertainty), an explicit construction of the estimated statesset is essentially prevented by the generality of shapes[Kief-fer et al., 2002]. Instead guaranteed outer approximations ofthese sets, as accurate as possible, have been used the litera-ture. In case of nonlinear systems (or systems including mul-tiplicative uncertainty), such outer approximations are basedon subpavings[Kieffer et al., 2002], ellipsoids[ElGhaoui andCalafiore, 2001], zonotopes[Alamoet al., 2005], among oth-ers.

According to Algorithm 1, set-membership state-estimation involves three bounding operations applied to thepredicted states setX p

k , the consistent states setX ck and their

intersectionX ek .

3.3 Algorithm implementationThe prediction set step requires characterizing the setX p

k .This set can be viewed as thedirect imageevaluation ofg(xk, uk, θk). Jaulin [Jaulin et al., 2001b] provides an al-gorithm namedImageSp that computes an outer approx-imation of such image using subpavings (unions of non-overlapping boxes).

The consistent set step requires characterizing the setX ck .

This set can be viewed as the inverse image evaluation ofh(xk, uk, θk). Again, Jaulin[Jaulinet al., 2001b] providesan algorithm namedSivia that computes an outer approxi-mation of this set using subpavings.

Finally, the intersection set step requires characterizing thesetX e

k = X pk ∩X c

k . This can be implemented using the algo-rithm to intersect subpavings calledIntersection [Jaulinet al., 2001b].

Notice that the fact that outer approximations are obtainedby ImageSp andSivia has effects in the fault detectionproperties. Using such approximations, the property that anull intersection implies the existence of a fault still holds(no false alarms), but fault detectability decreases (hiddenfaults). The proposed algorithms allow to obtain good ap-proximations, but a trade-off between computation time andfault detectability appears.

When the dimension of the set to be characterized is high,since set computations algorithms asSivia andImageSpuses bisection in all directions, the computational time ex-plodes. Due to fault detection applications require real timeoperation, the computation time could be a limitation for setcomputation implementation. On the other hand, in the con-straints satisfaction implementation case, the use of contrac-tors and bisection when needed using constraint projectionsaves a lot of computation time. On the other hand, as wewill see in next section, the use of contractors and bisectionin constraints satisfaction problems could save a lot of com-putation.

The approach followed by algorithm presented in[Rinnerand Weiss, 2004] are similar to theSivia algorithm. In thesecond step ofSivia algorithm, a box[x] is discarded if theimage of[x] and the initial box[Y ] are empty. This is sim-ilar to the step in[Rinner and Weiss, 2004], when parts ofthe parameter space were dropped from consideration whenproved inconsistent. The consistency test described can beinterpreted as a inclusion function for the differential func-tion one sample ahead in time. An interesting feature of thepresented method in[Rinner and Weiss, 2004] is that the testis performed one sample at a time while keeping track of thestate vector and the parameter space that has not been provedinconsistent yet.

4 Fault Detection based on Set-MembershipState Estimation using ConstraintsSatisfaction

4.1 CSP background

A Constraints Satisfaction Problem(CSP) on sets can be for-mulated as a 3-tupleH = (V,D, C) [Jaulin et al., 2001a],where

• V = v1, · · · , vn is a finite set of variables,

• D = D1, · · · ,Dn is the set of their domains repre-sented by closed sets and

• C = c1, · · · , cn is a finite set of constraints relatingvariables ofV.


A point solution ofH is a n-tuple(v1, · · · , vn) ∈ D such thatall constraintsC are satisfied. The set of all point solutions ofH is denoted byS(H). This set is called the global solutionset. The variablevi ∈ Vi is consistentin H if and only if

∀vi ∈ Vi ∃ (v1 ∈,D1 · · · , vn ∈ D2) |

(v1, · · · , vn) ∈ S(H)

with i = 1...n. The solution of a CSP is said to begloballyconsistent, if and only if every variable is consistent. A vari-able is locally consistentif and only if it is consistent withrespect to all directly connected constraints. Thus, the solu-tion of an CSP is said to be locally consistent if all variablesare locally consistent.

4.2 Fault detection and CSPDefinition 2.1 suggests an alternative way of implementingset-membership state-estimators. Its corresponding mathe-matical expression can be viewed as a constraint satisfactionproblem[Jaulinet al., 2001a]. In this case, when a fault oc-curs, the solution to the CSP associated to the set-membershipstate estimation will be the empty setXk = ∅, detecting thepresence of an inconsistency.Algorithm2 resumes the com-putation procedure when CSP approach is used.

Algorithm 2 Computation ofX ek using CSP

1: V ⇐ x1, x2, · · · , xk, w1, w2, · · · , wk−1,v1, v2, · · · , vk−1, θ1, θ2, · · · , θk−1

2: D ⇐ X1,X2, · · · ,Xk,W1,W2, · · · ,Wk−1,V1,V2, · · · ,Vk−1,Θ1,Θ2, · · · ,Θk−1

3: C ⇐ xk+1 = g(xk, uk, θk) + wk,yk = h(xk, uk, θk) + vk

4: HX ek

= (V,D, C)5: X e

k = solve(HX ek)

6: if X ek = ∅ then

7: Exit (Fault detected)8: end if

4.3 Practical implementationIt is well known that the solution of CSP using sets has ahigh complexity[Jaulinet al., 2001a]. A first relaxation con-sists on approximating the variable domains by means of in-tervals and finding the solution through solving anIntervalConstraints Satisfaction Problem(ICSP) [Hyvönen, 1992].The determination of the intervals that approximate in a morefitted form the sets that define the variable domains requiresglobal consistency, what demands a high computational cost[Hyvönen, 1992]. A second relaxation consists on solvingthe ICSP by means of local consistency techniques based oncontractors, deriving on conservative intervals and, of course,on imprecise solutions.

The principle of algorithms for solving ICSP using localconsistency techniques consists essentially on iteratingtwomain operations until reaching a stable state. These oper-ations are known asdomain contractionand propagation.Roughly speaking, if the domain of a variablevi is locallycontracted with respect to a constraintcj , then this domain

modification is propagated to all the constraints in whichvi

occurs, leading to the contraction of other variable domainsand so on. Thus, the final goal of such strategy is to con-tract as much as possible the domains of the variables withoutloosing any solution by removing inconsistent values throughthe projection of all constraints. The fact of projecting a con-straint with respect to some of its variables consists on com-puting the smallest interval that contains only consistentval-ues applying a contraction operator. Being incomplete by na-ture, these methods have to be combined with enumerationtechniques, for example bisection, to separate the solutionswhen possible. Domain contraction relies on the notion ofcontraction operators computing over approximate domainsin R

n. An algorithm for finding a solution of a ICSP can befound in[Jaulinet al., 2001b].

In the present paper, the ICSP is solved using a tool basedon interval constraint propagation, known asInterval Peeler.This tool has been designed and developed by research teamof the Professor Luc Jaulin[Baguenard, 2005]. The goal ofthis software is to determine the solution of ICSP defined inSection4.1 in the case that the domains are represented byclosed real intervals. The solution provides refined intervaldomains consistent with the set of ICSP constraints.

5 Case Study

The application example to show the effectiveness of the pro-posed approach for robust fault detection is tested on thelimnimeters of Barcelona sewer network where they are usedfor the control system[Cembranoet al., 2004].

The city of Barcelona has a combined sewer network ofapproximately 1,500 Km. This means that waste and rain-water go into the same sewers. Another important issue isthat Barcelona has a population of 3,000,000 inhabitants inan area of 98 Km2, so it has a very high density of popula-tion. Additionally, the yearly rain-fall is not very high (600mm/year), but it includes heavy storms typical of the Mediter-ranean climate that cause a lot of flooding problems and Com-bined Sewer Overflows (CSO) to the receiving waters. Thereis a remote control system in operation since 1994 which in-cludes, sensors, regulators, remote stations, communicationsand a Control Center in CLABSA1. Nowadays, as regulators,the urban drainage system of Barcelona contains 11 pumpingstations, 3 gates and 3 detention tanks which are regulated inorder to prevent flooding and CSO.

The remote control system is equipped with 56 remote sta-tions and it includes 21 rain-gauges and 91 water-level sen-sors providing real-time information about rainfall and waterlevels into the sewer network. All this information is cen-tralised at CLABSA Control Center through a supervisorycontrol and data acquisition (SCADA) system. The regu-lated elements (pumps, gates and detention tanks) are cur-rently controlled locally, i.e., they are controlled by there-mote station according to the measurements of sensors con-nected only to that station. However, a new project is under-way to design and implement a global optimal control system

1In catalan:Clavegueram de Barcelona, SA(company of resid-ual water management of Barcelona)


for the Barcelona sewer network with the objective of reduc-ing flooding and CSO. In its first phase, it has been demon-strated off-line for a test catchment covering an importantarea of the sewage system, including one detention tank, fivegates and overflow devices, as well as the main CSO sites.The system is now being implemented on-line at the ControlCenter, as part of the second phase of the project.

The global control application requires the use of an oper-ational model of the network dynamics in order to compute,ahead of time, optimal control strategies for the network ac-tuators. These strategies are based on the current state of thesystem (provided by SCADA sensors), the current rain inten-sity measures and appropriate rainfall predictions. The opti-mal strategy computation is an optimization procedure takinginto account all the physical and operational constraints of thesewer network, producing set-points which achieve minimalflooding and CSO. In its on-line implementation at the Con-trol Center, the global control system receives data from theSCADA, calibrates the model equations, runs the optimiza-tion program and finally sends the set-points to the SCADAthat forwards them to the regulators.

However, the global optimal control of the sewer networkis vulnerable to faults. Faults in sensors (rain-gauges andlimnimeters) and actuators (gates and pumps), specially inheavy rain scenarios are usual. If these faults are not detectedand isolated and if possible corrected introducing some mech-anism that assures fault tolerance, the global optimal con-trol have to be stopped, moving the control to local mode.Since in every rain scenario appear several faults (speciallyin sensors), it is highly probable that the control loop shouldbe stopped. This will make very difficult the success of theglobal control system.

5.1 Model descriptionLimnimeters can be monitored using a rainfall-runoff on-linemodel of the sewer network. Complex nonlinear rainfall-runoff models are very useful for off-line operations (calibra-tion and simulation) of the sewer network, but for on-line pur-poses, as the global optimal control, fault detection and faultdiagnosis, a more simple structure of the model must be se-lected. One possible model methodology to derive a rainfall-runoff real-time model of a sewer network is through a simpli-fied graph relating the main sewers and the set of virtual andreal reservoirs[Cembranoet al., 2004]. A virtual reservoiris an aggregation of a catchment of the sewer network whichapproximates the hydraulics of rain, runoff and sewage waterretention thereof. The hydraulics of virtual reservoirs couldbe expressed as (Figure 1):

dV (t)

dt= Qup(t) − Qdown(t) + I(t)S (6)

whereV is the volume of water accumulated in the catch-ment, Qup and Qdown are the inflows and outflows of thecatchment,I is the rain intensity falling in the catchment andS is its surface. Input and output sewer levels are measuredusing limnimeters and they can be related with flows using alinearised Manning relation:

Qup(t) = MupLup(t) (7)

Qdown(t) = MdownLdown(t) (8)

!

"# $#

Figure 1: Virtual reservoir model of a catchment.

Assuming in Eq. (6) that

Qdown(t) = KvV (t), (9)

if Eqs. (7) and (8) are substituted in Eq. (6) and the obtainedexpression is transformed to the corresponding discrete-timeform, yields:

Ldown(k + 1) = aLdown(k) + bLup(k) + cI(k)) (10)

wherea = (1 − Kv∆t), b =MupKv∆t

Mdownandc = SKv

Mdown.

%&'()&**+,&+, -&+

./.0 .12120 2/3031 4567859:5795;5<9=>7<9

?@AB>7C<3D@9:E7>F7<G3/ HE:9D7> 97<GI57> 97<GI7E< 87D85I53E:5A9EB<8795I595<9EB< 8795

J1JK1JK/ JL0 J/JL/J0 JL1JK0MNOPFQ

FRFS JTU V/WV/XV0Y Z[5:\>B6 9B@57]E;<E;595:

] / ]^_] `Figure 2: Case study: Portion of Barcelona Sewage Network.

For this paper, let us consider a portion of sewer network ofBarcelona shown in Figure 2. This portion, with a total sur-face of 22.6 Km2, includes several catchments in Barcelona.They are urban catchments located in the central part of thecity. The test catchment includes CSO elements and an areawhere flooding occurs frequently. It also contains one realdetention tank, the Escola Industrial Tank (35.000 m3), witha by-pass gate, an inlet and an outlet gate. The main objectiveof this tank is to avoid flooding downstream and to help min-imize CSO. Another two gates, Tarragona-Diputació, located


elsewhere in the catchment, used to divert storm flows, arealso included in the test catchment. Assuming the positionsof gatesC1 andC2 are fixed in order to avoid the water flowthrough linksQ1 andQ2 and gateC3 is completely open, thefollowing nonlinear model can be obtained using the conser-vation of mass in tanks and gates:

x1(k+1) = x1(k) + T[d1(k) + d2(k) − β1

√x1(k)

]

x2(k+1) = x2(k) + T[β1

√x1(k) − β2

√x2(k)

]

x3(k+1) = x3(k) + T[β2

√x2(k) + d3(k) − β3

√x3(k)

]

y(k) = β1

√x1(k) + v(k)

where

• T = 300s is the sampling period,

• xi is the water volume in thei-th virtual/real tank Ti,

• ui = di + wi corresponds to the rain inflow measuredusing rain gauges (P19, P16 andP20, according to Fig-ure 2), beingui the real inflow andwi the measurementnoise such thatwi(k) ∈ Wi = [0, 1],

• y(k) is the level in the output sewer of the tank 2, mea-sured using the limnimeterL41, and

• v(k) the associated noise defined byVi = [−0.35, 0.35].

Model parameters and associated uncertainty have been es-timated from real data using the procedure described in[Ploixet al., 1999]. The obtained values areβ1 ∈ [0.0634, 0.0694],β2 ∈ [0.0392, 0.0394] andβ3 ∈ [0.0584, 0.0644].

Fault detection using constraints satisfaction presentedinAlgorithm 2 has been implemented using MATLAB and In-terval Peeler solver considering the principles describedin[Jaulinet al., 2001b] (see[Baguenard, 2005]).

5.2 Fault scenario 04/06/2000First, a real rain scenario registered on 04/06/2000 is usedtotest the set-membership state estimator in case of a faulty sit-uation. In this scenario, sewer network operators reportedafault in limnimeterL41 (Figure 3). Results of state estima-tion usingAlgorithm2 are presented in Figures 4(a), 4(b) and4(c), where the bounds of state estimation interval for eachstate variable are presented. It can be noticed that the fault isdetected at time instantk = 65 because the apparition of aninconsistency. This inconsistency can be related directlyto afault limnimeterL41 since according to Eqs. (11), this sensoris used to correct the statex1.

5.3 Fault scenario 31/07/2002Next, a real rain scenario registered on 31/07/2002 is used assecond test for the set-membership state estimator in case ofa faulty situation. Again, in this scenario sewer network op-erators reported a fault in limnimeterL41 (Figure 5). Now,results of state estimation are presented in Figures 6(a), 6(b)and 6(c) where the bounds of state estimation interval for eachstate variable are presented. In this case, it can also be no-ticed that the fault is detected at time instantk = 71 becausethe apparition of an inconsistency. This inconsistency canberelated directly to a fault limnimeterL41 since according toEqs. (11), this sensor is used to correct the statex1.

0 50 100 150 20010

20

30

40

50

60

70

80

time [samples]

Leve

l L41

[cm

]

Figure 3: LimnimeterL41 measurement

6 Concluding RemarksIn this paper, robust fault detection using guaranteed stateestimations is presented. This estimation is based on inter-val models that describe parameter uncertainty. Additionally,process and measurement noise is considered to be unknownbut bounded. First, the problem of set-membership state es-timation is presented and two approaches to deal with itsimplementation are considered. The first approach is basedon set-computations while the second one is based on con-straints satisfaction. However, when the dimension of the setof uncertain states increases, computation time can increasequickly as set computations algorithms rely on bisections inall directions. Then, by using constraint satisfaction algo-rithm that uses contractors and interval arithmetic in combi-nation with bisections, the computational burden can be re-duced significantly. Then, the application of set-membershipstate estimation to fault detection is analyzed. Finally anap-plication example illustrates the performance of this approachwhen the state-estimation using interval constraints satisfac-tion is implemented.

The main contribution of this paper involves the consider-ation of both modelling uncertainty and measurement noisein a unified way using the set-membership state estimationframework. Another important contribution is to show thatconstraints satisfaction may be an useful tool applied on faultdetection.

AcknowledgmentsThe authors thank the support received from Barcelona SewerCompany (CLABSA) in the application presented in thiswork. The authors wish also to thank the support received bythe Research Comission of the Generalitat of Catalunya (ref.2005SGR00537) and by Spanish CICYT (refs. DPI2002-03500 and DPI2005-05415).

References[Alamoet al., 2005] T. Alamo, J.M. Bravo, and E.F. Cama-

cho. Guaranteed state estimation by zonotopes.Automat-


0 20 40 60 80 1000

500

1000

1500

2000

2500

time [samples]

Vol

ume

in ta

nk 1

[m3 ]

x1min

x1max

(a) Bounds for state variablev1.

0 20 40 60 80 1000

2000

4000

6000

8000

10000

12000

time [samples]

Vol

ume

in ta

nk 2

[m3 ]

x2min

x2max

(b) Bounds for state variablev2.

0 20 40 60 80 1000

500

1000

1500

2000

2500

3000

time [samples]

Vol

ume

in ta

nk 3

[m3 ]

x3min

x3max

(c) Bounds for state variablev3.

Figure 4: State bounds for rain scenario 04/06/2000.

ica, 41(6):1035–1043, 2005.

[Baguenard, 2005] X. Baguenard. Personal homepage.

0 50 100 150 2000

50

100

150

200

250

300

time [samples]

Leve

l L41

[cm

]

Figure 5: LimnimeterL41 measurement

http://www.istia.univ-angers.fr/∼baguenar/, February2005.

[Basseville and Nikiforov, 1993] M. Basseville and I.V.Nikiforov. Detection of abrupt changes: theory andapplications. Prentice Hall, 1993.

[Calafiore, 2001] G. Calafiore. A set-valued non-linear filterfor robust localization.Proceedings of European ControlConference, 2001.

[Cembranoet al., 2004] G. Cembrano, J. Quevedo,M. Salamero, V. Puig, J. Figueras, and J. Martí. Optimalcontrol of urban drainage systems: a case study.ControlEngineering Practice, 12(1):1–9, 2004.

[Chen and Patton, 1999] J. Chen and R.J. Patton. Ro-bust Model-Based Fault Diagnosis for Dynamic Systems.Kluwer Academic Publishers, 1999.

[Chisciet al., 1996] L. Chisci, A. Garulli, and G. Zappa.Recursive state bounding by parallelotopes.Automatica,32(7):1049 – 1055, 1996.

[ElGhaoui and Calafiore, 2001] L. ElGhaoui andG. Calafiore. Robust filtering for discrete-time sys-tems with bounded noise and parametric uncertainty.IEEE Trans. Automatic Controll, 46(7):1084–1089, 2001.

[Gertler, 1998] J. Gertler.Fault Detection and Diagnosis inEngineering Systems. Marcel Dekker, New York, 1998.

[Hyvönen, 1992] E. Hyvönen. Constraint reasoning basedon interval arithmetic: The tolerance approach.ArtificialIntelligence, 58:71–112, 1992.

[Jaulinet al., 2001a] L. Jaulin, M. Kieffer, I. Braems, andE. Walter. Guaranteed nonlinear estimation using con-straint propagation on sets.International Journal of Con-trol, 74(18):1772–1782, 2001.

[Jaulinet al., 2001b] L. Jaulin, M. Kieffer, O. Didrit, andE. Walter. Applied Interval Analysis, with Examplesin Parameter and State Estimation, Robust Control andRobotics. Springer-Verlag, London, 2001.


0 20 40 60 80 1000

1

2

3

4

5

6

7

8

9

10x 10

4

time [samples]

Vol

ume

in ta

nk 1

[m3 ]

x1min

x1max

(a) Bounds for state variablev1.

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5x 10

4

time [samples]

Vol

ume

in ta

nk 2

[m3 ]

x2min

x2max

(b) Bounds for state variablev2.

0 20 40 60 80 1000

1

2

3

4

5

6

7

8x 10

4

time [samples]

Vol

ume

in ta

nk 3

[m3 ]

x3min

x3max

(c) Bounds for state variablev3.

Figure 6: State bounds for rain scenario 31/07/2002.

[Kieffer et al., 2002] M. Kieffer, L. Jaulin, and E. Walter.Guaranteed recursive non-linear state bounding using in-

terval analysis.International Journal of Adaptive Controland Signal Processing, 16(3):193–218, 2002.

[Maksarov and Norton, 1996] D. Maksarov and J. Norton.State bounding with ellipsoidal set description of the un-certainty. International Journal of Control, 65(5):847 –866, 1996.

[Milaneseet al., 1996] M. Milanese, J. Norton, H. Piet-Lahanier, and E. Walter.Bounding Approaches to SystemIdentification. Plenum Press, 1996.

[Ploix et al., 1999] S. Ploix, O. Adrot, and J. Ragot. Param-eter uncertainty computation in static linear models.Pro-ceedings of IEEE Conference on Decision and Control,2:1916–1921, 1999.

[Puiget al., 2002] V. Puig, J. Quevedo, T. Escobet, andS. De las Heras. Robust fault detection approaches us-ing interval models.Proceedings of IFAC World Congress,2002.

[Rinner and Weiss, 2004] B. Rinner and U. Weiss. Onlinemonitoring by dynamically refining imprecise models.IEEE Trans. Syst., Man, Cybern., 34:1811 – 1822, 2004.

[Shamma, 1997] J. Shamma. Approximate set-value ob-server for nonlinear systems.IEEE Trans. Automatic Con-trol, 42(5):648 – 658, 1997.

[Witczaket al., 2002] M. Witczak, J. Korbicz, and R. Patton.A bounder-error approach to designing unknowninput ob-servers.Proceedings of IFAC World Congress, 2002.


Hierarchical Modelling and Diagnosis for Embedded Systems

Herve Ressencourt1,2 Louise Trave-Massuyes1 Jerome Thomas2

1 LAAS-CNRS 2 ACTIA7, Avenue du Colonel Roche 25, Chemin de Pouvourville

FRANCE-31077 TOULOUSE Cedex 4 FRANCE-31432 TOULOUSE Cedex 4hressenc,[email protected] [email protected]

Abstract

Because of the increasing complexity of engineeredsystems, abstractions and hierarchies in models arereceiving great attention. The behaviour of embed-ded systems is commonly characterised by hybridphenomena in which each operational mode is acti-vated by electronic units: it hence involves hard-ware and software components. The aim of thiswork is to apply a multimodelling approach on suchsystems for the diagnosis task. This is illustrated byan example taken from the automotive domain.

1 IntroductionThe increasing complexity of engineered systems led theModel-Based Reasoning (MBR) communities to focus theirresearch in reasoning tasks - like diagnosis - based on mul-tiple abstraction level models organised through a hierarchy.Abstractions are useful to reduce the computational complex-ity of diagnosis reasoning, to account for observations at qual-itative levels, and to handle systems whose available knowl-edge about components is heterogeneous.

Two kinds of hierarchies are commonly used in MBR:structural abstraction[Chittaro and Ranon, 2004] [Mozetic,1991] [Autio and Reiter, 1998], which aggregates compo-nents to describe the system at different levels of detail andfunctional abstraction, which abstracts the behaviour accord-ing to the functional and teleological understanding of thesystem[Chittaroet al., 1993] [Kitamuraet al., 2002]. Themain idea of a functional description is to bridge from be-havioural to teleological knowledge (knowledge about goals)by exhibiting the functional roles that the structural compo-nents may play in the achievement of the function of thewhole system.

The objective of this work is to devise a multimodellingcooperation framework for the diagnosis of complex embed-ded hybrid systems controlled by electronic units. A briefoverview of existing approaches on functional modelling isfirst proposed in section 2. In the third section the limits ofthese approaches are discussed and an extention of Chittaro’sframework[Chittaroet al., 1993] is proposed to model hy-brid physical systems including hardware and software com-ponents. Then, some perspectives are presented for the off-

board diagnosis task of automotive systems based on thisframework.

2 Knowledge representation for Model BasedReasoning

Knowledge representation is a key issue in MBR. Luca Chit-taro and colleagues[Chittaroet al., 1993] write that choiceshave to be made, especially about ontologies, epistemologicaltypes, representational assumptions and aggregation levels.These choices are mainly directed by the goals of the mod-els (design analysis, diagnosis...) and by the requirements ofthe reasoning task. It is commonly accepted that knowledgeabout physical systems can be organised through two axes[Lind, 1982]:

• The Whole-Part hierarchyrelies on different aggrega-tion levels for a same type of knowledge. An entity ofthis hierarchy is a part of the upper one. For example,structural abstraction has been used for the diagnosistask [Chittaro and Ranon, 2004] [Mozetic, 1991] [Au-tio and Reiter, 1998].

• The Mean-Endor functional hierarchyrelies on thetheological understanding of behavior[Chittaro et al.,1993] [Kitamuraet al., 2002].

A functional description hierarchy has to answer threequestions: ”Why was the system designed?”, ”What is thesystem supposed to do to achieve the goal?” and ”How mustdifferent parts of the system interact in order to realise thefunctions?”[Modarres and Chehon, 1999].

2.1 Functional abstraction hierarchySeveral works agree on a model hierarchy consisting in a dis-tinction between four epistemological types :

• The Structural knowledgeis the knowledge about sys-tem topology.

• TheBehavioural knowledgedescribes the physical lawsunderlying the behaviour of components composing thesystem.

• The Functional knowledgedescribes the roles compo-nents may play in the process in which they take part.This level is named Base-Function layer in[Kitamuraetal., 2002].


Figure 1: CPD example for an electrical circuit whose func-tion is to produce light

• The Teleological knowledgedescribes the goals of thesystem intended by its designer. This level is namedMeta-Function layer in[Kitamuraet al., 2002].

The functional knowledge level aims at bridging the struc-tural and behavioral knowledge on one side and the teleolog-ical knowledge on the other side, which respectively rely ontwo different ontologies :

• The object-centered ontology, sometimes namedcom-ponent ontology, assumes that the system is made of in-dividual objects with independent context properties andstated in a generic way.

• The system-centered ontology, sometimes namedpro-cess ontology, is a context dependent ontology. It as-sumes that the system involves a set of physical phe-nomena which are activated/disactivated according tothe current context.

Following these lines,[Chittaroet al., 1993] elaborated aproposal called themultimodelling approach, which bringssolutions to many critical problems like the formalisationofthe links between each model, their meaning and the repre-sentation language at each level. Notice that[Chittaroet al.,1993] framework matches the one by Kitamura[Kitamuraetal., 2002]. One feature of this approach is to implement thefunctionnal model in three interlinked levels: a model offunc-tionnal roles, a role being associated to a single component, amodel ofprocessesemerging from functional role networks,and a model ofphenomena.

2.2 Different approaches for functional modellingBeing at the crossroads of a component and a process basedontologies, the functional model necessarily relies on a hy-brid ontology. Many approaches have been suggested in theliterature for representing functional knowledge. They canbe classified into two categories: the state based approaches[Chandrasekaran, 1994] [Price and Snooke, 1998], relying onthe abstraction of behaviour states and the flow based ap-proaches, relying on flow models[Lind, 1982] [Chittaro etal., 1993]. Other works have been interested in the use of anontology for defining the functional concepts[Chittaroet al.,1993] [Kitamuraet al., 2002].

The state-based representationsFunctions are built from the knowledge of the causal relationsexisting among system’s states. System’s states correspond tosome instance assigned to the variables describing the system(generally assumed to have discrete value domains). Thus, a

directed graph can be defined in which the nodes are predi-cates about the states of the system and links indicate causalrelations. This graph is commonly named Causal Process De-scription (CPD)[Chandrasekaran, 1994]. All paths in thisgraph can be interpreted as a function of the system. Thisframework has been predominantly used for simulation anddesign analysis tasks[Bell et al., 2005] [Price and Snooke,1998]. A simple example of a lighting circuit is provided inFigure 1 to illustrate the approach.

It should be noticed that such functional models may notbe reusable since modelling choices are subjective and mayvary from user to user. Nevertheless, one advantage of thisapproach is that discrete event behaviors can be easily de-scribed together with continuous phenomena as long as theyare represented at a high level of abstraction.

The flow-based representationsThey are based on the concepts of generalised variables offlow and effort[Lind, 1982] [Chittaroet al., 1993] . Theseconcepts were firstly used by Paynter[Paynter, 1961] and theBond Graph community. In this method, a finite set of func-tional primitives is defined and the functional descriptionisexpressed in terms of these primitives. It should be noticedthat the primitives are the same for all flow-based representa-tions. The main advantage of this approach is that a real on-tology is defined, on which the functional modelling of anyphysical system can rely. Functional primitives are linkedto structural and behavioral components so that a given sys-tem functional model can be generated automatically fromthe behavioral and structural knowledge. One criticism aboutthis approach is that only physical devices are modelled. Noontology is suggested for components which have discreteevents and sequential behaviours.

Despite of the difference between state-based and flowbased approaches, they use a common principle : the func-tional model consists in a causal interpretation of behaviour.

3 Extended multimodelling frameworkIn this section, we propose a multimodelling frameworkbased on an extention of Chittaro’s to include software com-ponents implementing control actions, and hence deal withhybrid systems. Our approach stands on:

• adding a mode labelling to the behavioural model thatlinks to the corresponding operating mode,

• introduce newsoftwareprocesses and phenomena in themodel of processes and phenomena, respectively.

The approach is illustrated through all the section by a casestudy from the automotive domain presented in section 3.1.

3.1 Case study taken from the automotive domainThe rear wiping system, taken from the automotive domain,has been chosen (see Figure 2) to illustrate the multimod-elling problem for hybrid and controlled systems. There arethree means to activate the rear system on some modern cars:on request of the driver through activation of the steeringwheel switching module, on request of the rear washing func-tion, when the screen wiper is activated and the driver en-gages the reverse mode.


Figure 2: Synoptic diagram of the rear wiper system. The electrical variablesu1, uM andi1 represent respectively the voltageapplied to the electric circuit pins, the counter electromotive force and the intensity. The mechnical onesωM , γM andθM arerespectively the angular velocity of the rotor, the torque and the angular displacement of the rotor.

Figure 3: The automaton describing the behaviour of ECU2 (on the left) and its abstraction by a software process namedCONTROL. The conditionstr(e) andfs(e)mean a rising edge and a falling edge on the evente, respectively.

The synoptic diagram of this system is given in Figure 2.The rear wiper system is composed of two Electronic ControlUnits (ECU). The role of the first one (ECU1) is to receivethe request ”rear wiper on” (RCW = 1) selected by the driveron the steering wheel switching module and to transmit thisstatus to the second one (ECU2) through the data bus. ECU2has to elaborate and send the control signal which closes theelectric relay taking into account the user control data andother inputs coming from other functions of the vehicle.

This system is a sum of hardware components (electri-cal circuits, mechanical devices...) and software components(embedded pieces of software in ECUs, data bus). The be-haviour of software components is assumed to be describedby state machines1.

The behaviour of the rear wiper system (Figure 3) is char-acterized by a cycle composed of two activities: one wipingaction followed by a waiting state of the wiper at the iddle po-sition θM = 0 during a time intervalTint. The iddle state is

1State-charts are widely used in the automotive industry for soft-ware modelling.

triggered by ECU2 detecting the event associated toθM = 0sent by the switch K2. Such a behaviour is named ”intermi-tent wiping” in the automotive domain.

3.2 The structural and the behavioural models

The structural modelThe structural model describes the topology of the system byusing three primitives: the components and their terminals,nodes (to connect together two ore more components) andconnections (to describe how components are connected to-gether through nodes).

The behavioural modelThe behavioural model describes the internal properties ofeach component. There are three kinds of behaviour: thecontinuous behaviours are described by continuous equations(arising from physical laws for instance), the hybrid compo-nents are described by hybrid automatons and the softwarecomponents are described by discrete automatons2. In the

2In this paper, we restrict ourselves to discrete controllers.


Figure 4: The causal model of the rear wiper motor

case of hybrid components, the continuous behaviour rela-tions are labelled according to the corresponding operatingmode. Software components are virtual and defined by thefact that they each implement the control of a hybrid compo-nent.

3.3 The functional modelThe functional model has to achieve the bridge between thebehavioural and the teleological knowledge. So, its aim is todescribe how the behaviours of individual components con-tribute to the achievement of the function intended by the de-signer. Note that the same component may contribute to theachievement of more than one function when acting in dif-ferent operating modes. Functional modelling is performedby using three progressive levels of interpretation: the causalmodel, the model of processes and the model of phenomena.Notice that our functional model replaces the functional rolemodel of[Chittaroet al., 1993] by a causal model, which al-lows us to exhibit the processes automatically as proposed by[Thetiotet al., 1998].

The causal modelCausality is one of the essential concepts for reasoning aboutphysical systems. It is widely used in qualitative physics toexplain and to predict physical systems’ behaviours.

Several operational methods have been proposed for theautomatic generation ofcausal links, also namedinfluences,from the behavioural knowledge. Among the most wellknown methods are causal ordering algorithms[Iwasaki andSimon, 1994] [Trave-Massuyes and Pons, 1997]and the BondGraph based method[Thetiot et al., 1998]. By consider-ing the rear wiper electric motor example (Figure 2), the be-havioural equations when the relayK1 is closed are:

u1 = U0 (1)

u1 = R1 · i1 + uM (2)

uM = k · ωM (3)

γM = k · i1 (4)

γM = Cr (5)dθM

dt= ωM (6)

In this behavioural description, two physical views are rep-resented. The equations (1) and (2) correspond to an electricalview and the equations (5) and (6) correspond to a mechani-cal view. The mapping between this views is given by equa-tions (3) and (4). The causal influences between the variablesof this device, given in Figure 4, form a causal network inwhich each arrow between two variablesx andy (x −→ y)

means that ”x influencesy” or that ”the events occuring inxinfluence the events occuring iny”, without specifying theseevents.

When the behaviour of a physical system is hybrid (withdifferent operating modes), a causal model is computed foreach mode and labelled accordingly. In the exemple, thecausal model of Figure 4 corresponds to the mode ”the elec-tric relay is closed” and this mode is activated by the state ofthe software implemented in the ECU2.

Hence, like for the behavioural model, the software com-ponents are represented at the causal model level by the la-belling. One reason for this choice is that software compo-nents implement control actions determining the operatingmode of components according to the events occuring on theirinputs. Their input-output mapping is explicitly representedat the level of processes (cf. sectionThe model of processes)and above.

The model of processesThe model of processes represents the set of processes thatmay occur in a system and their relationships. One processis specialised in the physical or the software domain.Physi-cal processesare mapped to a set of causal influences of thecausal model.Software processesare each mapped to theau-tomatonthat determines the mode of one physical compo-nent. In Figure 3, the software process namedCONTROLis an abstration of the partial automaton that determines themode of the electric relay through the outputOut1 of ECU2.So, a process is represented by a four-tuple<name, cofunc-tion, precondition, effect> [Chittaroet al., 1993]:

• nameis the name of the process.

• cofunction is a causal network which specifies whichcausal influences are necessary to enable the occurenceof a physical process; or an automaton for a softwareprocess.

• preconditionis a logical predicate which characterisesthe situation which enables the process to occur.

• effectis a logical predicate which characterises the situ-ation during the occurence of the process.

The organisation of processes of the rear wiper system isgiven in Figure 5. There are three kinds of processes accord-ing to the different views:

• Pm1, Pm2 and Pm3 are mechanical processes. Pm1 andPm3 are named “SWITCHING” because they are relatedto the mechanical actions on the rear wiper control andon the contact K1, respectively.

• Pe1, Pe2 and Pe3 are electrical processes. They arenamed “TRANSPORTING” like in[Chittaro et al.,1993].

• Ps1, Ps2 and Ps3 are software processes. Ps1, named“STORAGE”, is related to the software componentECU1 whose role is to observe the state of the rear wipercontrol, to store it in a message for being sent to theECU2 through the data bus. Ps2, named “TRANSPORT-ING”, is related to the proccess occurring in the databus. Notice that the process “TRANSPORTING a gen-eralised variable” described in[Chittaroet al., 1993] is


extended to “TRANSPORTING a message” in the soft-ware view. Then, Ps3 is related to the software processoccurring in the ECU2.

The model of phenomenaThe last model in the functional knowledge level is the modelof phenomena. One phenomenon is described by a four-tuple<organisation, precondition, effect> in which organizationis a process network which defines which processes are nec-essary and how they must be related together in order to en-able the occurence of the phenomenon. So, a phenomenon isan aggregation of processes organised throught causal links.

Two high level phenomena are described for the rear wipersystem (Figure 5):

• The phenomenon PH1, named “WIPING” is related tothe wiping action of the system (to the state “ON” of theautomaton of Figure 3).

• The phenomenon PH2, named “WAITING” is related tothe idle state of the system.

Like in [Chittaroet al., 1993], theontological linksallowone to describe the links between phenomena and processes.

3.4 The teleological modelThe teleogy of a system is defined as the specification of thefunctions as they are intended by its designer. This notion isclose to the perception of the behaviour by a human user. Thedefinition of function which is be used in this work is:

Definition 1 A function defines a mapping between a con-junction of conditions on atomic inputs and a given system’sstate as it is intended by the designer or perceived by the useras output.

A function is commonly represented by a triple<functionpattern, operational conditions, effects>3:

• The function patternassigns a name to the function andspecifies its arguments which are variables relevent tothe definition of the goal. ”To wipe the rear window witha user activation” identifies one function of the system.The angular positionθM of the wiper is the argument.

• Thepreconditionsare the operational conditions whichspecify what should be provided as input to the systemfor the achievement of the intended function. Thereare two operational conditions for the example: ”therear wiper control is activated” (RWC) and ”the bootis closed” (In1).

• Theeffectsspecifies the behavioural intended behaviourthe function. For the example, the effect is: ”intermittentwiping” mapped to the variableθM .

This representation of function refers to two important no-tions: on the one hand, it describes a goal in terms of thedesired artifact’s behaviours and on the other hand it is linkedwith the notion of testability. The operational conditionsand

3Some authors use the terms preconditions and effects insteadof operational conditions and intended behaviour, respectively. Weprefer the later because these terms map better the physical systemto the human designer/user

the intended behaviours have to be clearly defined to enabletesting the function achievement. This issue is discussed inthe next section for the diagnosis task.

3.5 The sequential behavioursIt can be noticed that in Figure 5 the abstraction of the twostates sequences “IDLE” and “ON” up the hierarchy deservesfurther attention. For this kind of behaviour, classified as“sequential and intermittent behaviour”, preconditions andeffects may be expressed by using temporal logic operators[Bell and Snooke, 2004].

4 Off-board multimodel based diagnosis4.1 The off-board diagnosis issue in the

automotive domainDiagnosis is the process of identifying the cause (fault) ofa system’s malfunction by observing the system at variousmonitoring (test) points. The number of possible causes ofdysfunction has increased with the technological advancesofautomotive systems while reduction in the number of moni-toring points results in reduced observability, making increas-ingly difficult to troubleshoot vehicles.

The different types of observationsThe diagnosis task is driven by the available observations.Inthe automobile domain, observations are of different typesranging from functional symptoms reported by the clients toqualitative observations and physical measurements. The ob-servations can be classified as follows:

• A functional symptomrelies on a high level observationby providing information about the functions and theirfailures. More precisely, it refers to a missing intendedbehaviour of a function. When it is reported by a clientto the garage mechanic, it is called “client symptom”.

• ECU’s data: when the garage mechanic connects itscomputer to the diagnosis interface of a car, he can ac-cess some input/output variables of the ECUs, useful forthe diagnosis task. These variables are of two types:physical or logical quantities or fault codes4.

• A physical measurementis an observation at the be-havioural knowledge level.

The test sequencing problemThe off-board diagnosis problem, in the automotive domain,is equivalent to a test problem. The diagnosis activity startswith a set ofpreliminary symptomsgathered by the garagemechanic. These preliminary symptoms are fault codes,client symptoms and other preliminary garage mechanic ob-servations.

Then, the fault isolation problem is defined as the deter-mination of the additional information (obtained by tests)which allow the best discrimination among the diagnostichypotheses generated with the preliminary symptoms. Onetest is defined by the variable which has to be observed,

4Most ECUs are equiped with an auto-diagnosis function whichreliably detects which of the electric circuits connected to one ECUare failing. The failed electric circuits are associated with fault codes


Figure 5: Teleological abstraction of the rear wiper system’s behaviour.

the configuration in which the system must be to performthe test and the possible outcomes of the test (generatingnew symptoms). Some previous works have proposed solu-tions to diagnose electric circuits in the automotive domain[Faure, 2001][Olive, 2003][Priceet al., 1995][Sachenbacherand Struss,].

4.2 Perspectives for the diagnosis taskThere are few approaches in the litterature which use the fourepsitemological types in a cooperative way for the diagno-sis task. Chittaro[Chittaroet al., 1993] has suggested onemethod for focusing the diagnostic activity. Interpretativeknowledge indeed permits to achieve the diagnostic task ina hierarchical way. At the teleological level, client symp-toms allow one to identify the functions which undergo fail-ures. By exploiting the bridge between teleology and be-haviour, only those parts of the structural and behaviouralmodels which are responsible for the unachievement of thefunctions can be considered.

The different steps of the diagnosis process, illustrated in

Figure 6, are organised as follow:

• The diagnosis model generationconsists in computingthe hierarchical models from design data (electrical dia-grams, State-Charts, functional descriptions).

• The symptom translationconsists in propagating thesymptom up and down the different levels of the hierar-chy through the different links in order to glean as muchinformation as possible delivered by the symptom de-scription.

• The Diagnostic hypothesis generationconsists in theisolation of faults in the system. If the symptoms whichare already available are not sufficient to correctly iden-tify a unique hypothesis, the reasoning algorithm needsmore information.

• The test selectionissue depends on the selected diag-nosis method. For this task, our objective is to suggestat each step the best next test for which the associatedsymptoms result in the maximal information gain.


Figure 6:The synopsis of the diagnosis strategy

5 Conclusion

The work presented in this paper has pointed out the potentialbenefits of using a multimodel cooperation for modelling andtroubleshooting complex embedded systems.

High level symptoms, like client symptoms in the auto-motive domain, are directly linked to functions described atthe teleological level. Thus, the multimodel hierarchy mapsthese symptoms to the behaviour of each individual compo-nent. Conversely, measured physical values can be abstractedup the hierarchy. All types of available symptoms can hencebe used to diagnose the system.

Our aim is to apply the multimodelling framework to thetest sequencing problem. Although this issue deserves fur-ther investigation, its advantages already appear clearly. In-stead of restricting the tests to be selected to one type of testslike in [Faure, 2001][Olive, 2003], all types of tests can nowbe proposed. The gain might be highly significant: a test atthe functional level is generally much cheaper than a physicalmeasurement that generally requires desassembling mechan-ical components. The links between each level of the func-tional hierarchy should allow us to propagate the observationsmade at a given level up or down the other levels, increasingobservability and diagnosability.

References

[Autio and Reiter, 1998] K. Autio and R. Reiter. StructuralAbstraction in Model-Based diagnosis. In13th EuropeanConference on Artificial Intelligence ECAI-98, pages 269–273, Brighton (UK), 1998.

[Bell and Snooke, 2004] J. Bell and N. Snooke. Describ-ing System Functions that Depend on Intermittent andSequential Behavior. In18th International Workshop onQualitative Reasoning QR’04, Evanston, (USA), 2004.

[Bell et al., 2005] J. Bell, N. Snooke, and C. Price. Func-tional Decomposition for Interpretation of Model-BasedSimulation. In19th International Workshop on Qualita-tive Reasoning QR’05, Austria, 2005.

[Chandrasekaran, 1994] B. Chandrasekaran. FunctionalRepresentation and Causal Processes.Advances in Com-puters, 38:73–143, 1994.

[Chittaro and Ranon, 2004] L. Chittaro and R. Ranon. Hi-erarchical Model-Based Diagnosis Based on StructuralAbstraction. Advanced Engineering Informatics, 155(1-2):147–182, 2004.

[Chittaroet al., 1993] L. Chittaro, G. Guida, C. Tasso, andE. Toppano. Functional and Teleological knowledge inthe multimodeling approach for reasoning about physicalsystems: A case study in diagnosis.IEEE Transactions onSystems, Man and Cybernetics, 23(6):1718–1751, 1993.

[Faure, 2001] P. P. Faure. An Interval Model-Based Ap-proach for Optimal Diagnosis Tree Generation : Applica-tion to the Automotive Domain. PhD thesis, LAAS-CNRS,2001.

[Iwasaki and Simon, 1994] Y. Iwasaki and H. Simon.Causality and Model Abstraction.Artificial Intelligence,67(1):143–194, 1994.

[Kitamuraet al., 2002] Y. Kitamura, T. Sano, K. Namba, andR. Mizoguchi. A Functional Concept Ontology and Its Ap-plication to Automatic Identification of Functional Struc-tures.Advanced Engineering Informatics, 16(2):145–163,2002.

[Lind, 1982] M. Lind. Multilevel Flow Modelling of Pro-cess Plant for Diagnosis and Control. InProc. Interna-tional Meeting on Thermal Reactor Safety, pages 1653–1666, Chicago (USA), 1982.

[Modarres and Chehon, 1999] M. Modarres and S.W.Chehon. Function-Centered Modelling of EngineeringSystems Using the Goal Tree-Success Tree Techniqueand Functional Primitives.Reliability Engineering andSystems Safety, 64:181–200, 1999.

[Mozetic, 1991] I. Mozetic. Hierarchical Model-Based Di-agnosis. Int. J. of Man-Machie Studies, 35(3):329–362,1991.

[Olive, 2003] X. Olive. Approche Integree Base de Modelespour le Diagnostic Hors-Ligne et la Conception: Appli-


cation au Domaine de l’Automobile. PhD thesis, LAAS-CNRS, 2003.

[Paynter, 1961] H.M. Paynter. In MIT Press, editor,Analysisand Design of Engineering Systems. Cambridge, 1961.

[Price and Snooke, 1998] C. Price and N. Snooke. Hierar-chical Functional Reasoning.Knowledge Based Systems,11:301–309, 1998.

[Priceet al., 1995] C. Price, D. R. Pugh, M. S. Wilson, andN. Snooke. The flame system: Automating electrical fail-ure mode effects analysis (fmea). InAnnual Reliabilityand Maintainability Symposium, pages 90–95, Washing-ton D.C., 1995.

[Sachenbacher and Struss,] M. Sachenbacher and P. Struss.Aqua: A framework for automated qualitative abstraction.In 15th International Workshop on Qualitative Reasoning,QR-01, pages 5–12, San Antonio, Texas (USA).

[Thetiotet al., 1998] R. Thetiot, F. Zouaoui, M. Dumas,P Dague, and T. Renaud. Automatic construction of pro-cesses from bond graph representation. InProc. of theInternational Workshop on Qualitative Reasoning QR’98,pages 131–136, Cape Code, (USA), 1998.

[Trave-Massuyes and Pons, 1997] L. Trave-Massuyes andR. Pons. Causal Ordering for Multiple Mode Systems. InProc. of the 11th International Workshop on QualitativeResoning QR’97, pages 203–214, Cortona (Italia), 1997.


A Bayesian Approach to Efficient Diagnosis of Incipient Faults

Indranil Roychoudhury, Gautam Biswas and Xenofon KoutsoukosInstitute for Software Integrated Systems (ISIS)

Department of Electrical Engineering and Computer ScienceVanderbilt University

Nashville, TN 37235, USAEmail:indranil.roychoudhury, gautam.biswas, [email protected]

Abstract

Safe, reliable, and efficient operation of complexdynamical systems requires the ability to detect,isolate, and identify degradation in system compo-nents. Degradations are typically modeled as in-cipient faults, which are slow drifts in system para-meters over time. This paper presents an efficientapproach for the detection, isolation, and identifi-cation of incipient faults under uncertainty using aDynamic Bayesian Network (DBN) approach. Ini-tially a DBN is used as an observer to track nominalsystem behavior. Once a fault is detected, incipi-ent fault hypotheses are generated using a variationof our qualitative TRANSCENDapproach for abruptfault isolation. A modified DBN that includes theactive fault hypotheses is then used to isolate thetrue fault and estimate the rate of change in its pa-rameter value.

1 IntroductionSafe, reliable, and efficient operation of complex systemsrequires the ability to detect, isolate, and identify degrada-tion in system components. Degradations are often mod-eled as incipient faults, which are slow drifts in system pa-rameter values over time. In our previous work, we havedeveloped fault diagnosis schemes for abrupt faults, whichare modeled as instantaneous changes in system parame-ter values at a point in time. The qualitative fault isola-tion (QFI) scheme is based on the analysis of transients inthe dynamic system behavior[Mosterman and Biswas, 1999;Narasimhan and Biswas, 2006; Roychoudhuryet al., 2005;Daigle et al., 2006]. This approach has to be modified toaccommodate the temporal profile for incipient faults (seeFig. 1).

This paper presents an efficient approach for the diagno-sis of incipient faults by combining a variation of the TRAN-SCEND qualitative fault isolation approach[Mosterman andBiswas, 1999] with a quantitative fault isolation and identi-fication scheme that employs a Dynamic Bayesian Network(DBN) model of the system dynamics. In general, DBN-based diagnosis approaches for complex systems suffer fromcomputational intractability because of the large number ofnodes (i.e., system variables and possible fault hypotheses)

d(t)

tdto time

parameter

value

pIF(t)

p(t)

Figure 1: Incipient Fault Profile

that have to be included in the DBN model. In our approach,efficiency is achieved by performing the fault isolation andidentification in two steps: (i) run an efficient qualitativefaultisolation scheme to reduce the number of candidate hypothe-ses to a small number, and (ii) run a refined DBN model touniquely isolate the single fault candidate and estimate therate of change in its parameter value. The focus of this pa-per is on fault isolation and identification of incipient faultsin continuous dynamic systems. We assume that only single,incipient faults occur in the system. This assumption is re-quired for the qualitative analysis only1. The quantitative FIIframework can handle multiple fault hypotheses.

The paper is organized as follows. Section 2 presents amathematical definition of incipient faults and formulatesourapproach for solving the incipient fault diagnosis problem.Section 3 presents the incipient fault diagnosis architecture,and gives a brief overview of the fault detection, isolation, andidentification subsystems. The different models employed fordiagnosis are presented in Section 4. Section 5 explains inmore detail the algorithms for incipient fault diagnosis. Sec-tion 6 presents results of applying this approach to a two tanksystem and conclusions are presented in Section 7.

2 Incipient Fault DiagnosisA complete incipient fault diagnosis scheme must be tailoredfor detection, isolation, and identification (FDII) of incipientfaults. Like earlier work, our diagnosis approach focuses onparametric component faults. In this framework, the mathe-matical representation of an incipient fault adds a drift termto the nominal component parameter value.

Definition 1 (Incipient fault) An incipient fault profile in adynamic system is characterized by a gradual drift in the cor-

1Daigle, Koutsoukos, and Biswas (DX 2006) have developed anextension of the TRANSCENDscheme for multiple fault diagnosis


Qualitative Fault Isolation

Fault Detection

Observer Quantitative Fault Isolation and Identification

Nominal DBN

DBN Modeling Faulty

Behavior Bond Graph Plant Model

Temporal Causal Graph

Procedure Model

Plant

Y

Y ^ R s

P

< p, p s >

Y

P

Plant

Y

Figure 2: The diagnosis architecture

responding component parameter value from the time pointof failure occurrence. The temporal profile for an incipientfault in parameter p, pIF (t) is given by:

pIF (t) =

p(t) t ≤ top(t)+d(t) t > to

(1)

where p(t) represents the nominal value of a parameter pover time, and d(t) is the drift in the parameter value that getsadded to the parameter value after occurrence of the fault,i.e., after t ≥ to.

Fig. 1 shows an incipient fault profile, witht0 as the timeof occurrence of the fault. Since the rate of change of theparameter value is slow compared to the system dynamics,we can approximate the drift term,d(t) = ps(t − t0), t ≥ t0,whereps is a constant that defines a linear rate of change, andt0 is the time point at which the incipient fault first occurs.

2.1 Detection of Incipient FaultsFault detection is the first step in any diagnosis process. Theobserver for tracking nominal behavior is based on a DBNmodel. This observer-generated expected behavior of the sys-tem is compared against the actual measurements using a Z-test for difference in means for robust fault detection[Biswaset al., 2003].

Ideally, deviations in measurements caused by faults anddegradations should be detected at or very soon after the pointof fault occurrence. In reality, to accommodate measurementnoise, inaccuracies in the model, and sensitivity of the detec-tion scheme one has to trade-off false alarm generation versusdetection delays. Statistical hypothesis testing schemeshelpreduce the false alarm rate, but introduce a delay betweenthe time of occurrence and detection of faults, i.e.,td > to.This detection delay,td− to, may pose convergence problemsand reduce the parameter estimation accuracy. In our pre-vious work on qualitative diagnosis[Manders and Biswas,2003], we have shown that this delay does not affect diag-nosis accuracy. In this approach, we assume this delay tobe short enough not to affect qualitative diagnosis and theDBN-based estimation schemes. To ensure convergence ofthe DBN scheme, we start the estimation process from thetime point at which the fault was detected.

2.2 Qualitative Fault IsolationAs the first step after fault detection, we employ a qualita-tive inference procedure using symbolic deviations and qual-itative fault signatures for generating and refining possible

fault hypotheses. This extends our previous work on tran-sient analysis of abrupt faults[Mosterman and Biswas, 1999].Unlike abrupt faults, which are modeled as a± change in pa-rameter value at the point of fault occurrence, incipient faults,characterized by slow drifts in parameter values (see Defini-tion 1), are modeled qualitatively as(0,±) change profiles,i.e., there is no change in the faulty parameter value at thepoint of fault occurrence but the parameter value slowly in-creases (decreases) over time. This fault profile matches anydrift function d(t) that is monotonic. Given such fault pro-files, the TRANSCENDscheme for qualitative hypothesis gen-eration and refinement can be applied for qualitative fault iso-lation. This methodology is outlined in Section 5.3.

2.3 Quantitative Fault Isolation and Identification(FII) using DBNs

Quantitative FII is the final step in the fault diagnosis pro-cedure. The TRANSCEND scheme discussed in Section 2.2,may not return an unique fault candidate, but it typically re-duces the number of fault hypotheses to a tractable number.This makes it feasible to run a quantitative FII procedure us-ing a DBN, outlined in Section 5.4, to refine the candidate setand estimate the drift parameter for the true fault candidate.

3 Architecture for Incipient Fault DiagnosisThe architecture of our model-based diagnosis methodology,presented in Fig. 2, follows a traditional diagnosis schemeforcontinuous systems. The system, as outlined in Section 2, in-cludes four primary modules: (i) the observer, (ii) fault detec-tor, (iii) the qualitative fault isolation unit, and (iv) the DBN-based FII unit. We build the dynamic plant model in thebondgraph (BG) modeling language[Karnoppet al., 2000] usinga methodology where the components of interest in the sys-tem can be identified by one or more bond graph parameters,such as source elements, capacitors, inertias, resistance, andtransformers. We derive the temporal causal graph (TCG)from the BG plant model using techniques that have been de-scribed earlier[Mosterman and Biswas, 1999]. The TCG,which is an extension of signal flow graphs, includes all thesystem variables as well as the component parameters that de-fine dynamic system behavior. The TCG model is explainedin greater detail in Section 4.1.

The observer is constructed as a DBN model of the nom-inal system. DBN tracking accommodates plant model in-accuracies and noisy measurements. Its inputs are the plantmeasurements,Y. The DBN is derived from the TCG modelusing the method described in[Lerneret al., 2000], and out-lined in Section 4.2. We use standard Bayesian propagationtechniques[Russell and Norvig, 1995] to derive estimates ofthe most likely system state,X, and measurement values,Yas plant behavior evolves. As discussed earlier, incipientfaultparameters change at a very slow rate, which makes the de-tection of changes due to the incipient faults a hard problemsince it becomes difficult to separate the measurement devia-tions from measurement noise and discrepancies caused bymodeling inaccuracies. We employ statistical methods forrobust fault detection. The input to the Fault Detector arethe plant measurementsY and the observer-predicted mea-surementsY. A significant difference in the observed and


C1

C2 R12 R1 R2

f3

f1

f5

e2 e7

In flow

Tank 1 Tank 2

(a) The two tank system schematic

0 1 0 Sf

C:C1

R:R1

R:R12 C:C2

R:R2

1

2 5

3

4 6

7

8

(b) The bond graph model of the two tank sys-tem

Figure 3: The two tank system and its BG model

expected behavior,(Y − Y) signals a fault occurrence, andthe qualitative residual signalsRs generated from the pointof fault detectiontd are used for hypothesis generation andrefinement.

When the fault detector triggers, the DBN observer is sus-pended and the TRANSCEND procedure is activated. Thequalitative residual signals,Rs, are used for initial hypothesisgeneration, and for hypothesis refinement as additional mea-surements deviate using qualitative methods. All measure-ments from the time point of failure detection are also cachedfor use by the module. The qualitative scheme is terminatedwhen one of the following conditions becomes true: (i) thenumber of fault candidates is reduced below a certain num-ber, (ii) all measurement deviations have been used, or (iii) apre-specified time horizon is exceeded. The DBN based FIIscheme is then initiated with a DBN model of the faulty sys-tem behavior from the point of detection of the incipient fault.The set of current fault hypotheses,P are used to extend thenominal DBN to the fault DBN for tracking the system behav-ior after fault occurrence. Again, standard Bayesian updatefunctions are employed, and with additional measurementsthe estimates converge to the true observed measurements.At this point, using least square estimation techniques, therate of change of the fault is estimated. The output from theFII unit is the fault hypothesis and its rate of change, i.e.,< p, ps >. The steps outlined above are explained in detail inthe following sections.

We believe that this approach provides an efficient com-putational scheme for solving the incipient fault diagnosisproblem in the presence of measurement noise and modeluncertainty. The Z Test-based fault detection module per-forms quick and reliable incipient fault detection while avoid-ing false alarms. The isolation and identification process is

f1 f2 e2 e4 e5 f5 f6 f7

e8

e7

f8 f4

e6 1/R1

1 (1/C1)dt = 1

-1

=

1/R12

-1 =

=

1/R2

-1

(1/C2)dt 1

=

f3 e3

= -1

Figure 4: The temporal causal graph of the two tank system

made computationally simpler by combining the TCG basedqualitative fault isolation and the DBN-based FII procedures.As presented in[Lerneret al., 2000], FDII of incipient faultscan be achieved by using a single DBN that models both thenominal as well as all possible faulty behavior of the system.However, this makes the number of possible fault hypothe-ses very large, and an exhaustive online tracking procedureis not computationally viable. For this reason, the procedureoutlined in [Lerneret al., 2000] involves dropping unlikelyfault candidates to save on computation. It is, therefore, pos-sible that a true fault is dropped early as its probability ofoccurrence is very small. Our diagnosis approach retains allpossible faults without compromising on efficiency. This isachieved by starting the DBN-based FII procedure only afterthe TCG based hypothesis refinement, thereby reducing thenumber of nodes in the DBN.

4 ModelingAny model-based diagnosis approach can only be as good asthe models that form the core of the diagnosis methodology.As discussed earlier, component-based BGs form the core ofour modeling framework for physical plants. Efficient modelsfor diagnosis, the TCG, state space models, and the DBNs areall derived from the primary BG plant model. This sectiongives a brief summary of the different models that we employfor incipient fault diagnosis.

4.1 Temporal Causal GraphA TCG can be described as adiagnosis modelthat capturesdependencies (algebraic and temporal) between system vari-ables as a causal structure. The TCG is derived directly fromthe bond graph model of the plant[Mosterman and Biswas,1999]. The TCG derived from the BG model can be definedas follows.

Definition 2 (Temporal Causal Graph (TCG))A TCG is adirected graph< V,L,D >. V = E ∪ F, where V is a setof vertices, E is a set of effort variables and F is a set offlow variables in the bond graph system model. L is the labelset=,1,−1, p, p−1, pdt, p−1dt (p is a parameter name ofthe physical system model). The dt specifier indicates a tem-poral edge relation, which implies that a vertex affects thederivative of its successor vertex across the temporal edge.D ⊆ V × L×V is a set of edges[Narasimhan and Biswas,2006] .

Fig. 3(a) shows the schematic of a two tank system thatwe will use as an example in this paper. The system com-prises a couple of interconnected tanks, each having an out-flow pipe for draining the tank. The first tank also has a


source of flow for filling the tank. Fig. 3(b) shows the bondgraph model. Bonds drawn as half-arrows capture the energy-exchange pathways in the system. Pipes are modeled as re-sistances and the tanks are modeled as capacitances. PipesR1 andR2 drain tanksC1 andC2, respectively, and pipeR12connects the two tanksC1 andC2. Fig. 4 shows the TCG forthe two tank system. Temporal relations in the TCG are as-sociated with the energy storage elements, i.e., the tanks.Allother relations in the TCG, e.g., the pressure-flow relationsimposed by the pipes and the idealized junction relations, arealgebraic.

4.2 The DBN Observer for the Nominal SystemThe DBN observer for the nominal system is constructedfrom the TCG, as outlined in[Lerneret al., 2000]. The DBNmodel is made up of two components:

1. A regular Bayes net that captures the relations betweensystem variables at any time slicet. This consists offour sets of variables(Xt ,Zt ,Ut ,Yt), which represent thestate variables, other hidden variables, input variables,and measured variables for the dynamic system, and

2. A two-slice temporal Bayes net that captures the across-time relations defined by the state equation model ofthe dynamic system. We assume that the state equationmodel is a discrete-time stochastic process that satisfiesthe first order Markov assumption. Therefore, the acrosstime links between time slicest andt +1 are defined bythe system state equations.

For the two tank system, the DBN derived from the TCG hasthe following variables at timet: Xt = e2t ,e7t, the pres-sures at the bottom of tanks 1 and 2, respectively,Ut = f 1t,the flow into tank 1, andYt = f 2t , f 8t , f 5t, the outflowsfrom tanks 1 and 2, respectively and the flow between tanks1 and 2. Zt = φ , i.e., the two tank dynamic model requiresno additional variables.The across-time model includes fivelinks, e2t → e2t+1, e7t → e7t+1, e2t → e7t+1, e7t → e2t+1,and f 1t → e2t+1. These links are directly derived from thestate space model of the system. Fig 5(a) shows the DBNobserver for time stepst andt +1.

4.3 The DBN DiagnoserModel-based diagnosis schemes require the models to repre-sent both the nominal and faulty system behavior. The DBNobserver derived from the system TCG model represents astochastic model of nominal system behavior in Fig. 5(a).Tracking of faulty behavior requires a stochastic model thatcaptures incipient fault effects. The procedure for derivingthis DBN is also detailed in[Lerneret al., 2000]. To capturefaulty system behavior, two sets of nodes are added. The firstset correspond to parameters that represent the incipient faulthypotheses. The second set are discrete-valued nodes that arein 1-1 correspondence with the fault parameters, and they in-dicate the absence or presence of an incipient fault for thatpa-rameter. Fig. 5(b) shows the DBN diagnoser for faulty behav-ior of the two tank system, assuming two potential fault hy-potheses,R2,R12. In other words, the DBN for faulty be-havior now has an extended setXt that includesD2t ,D12tin addition toe2t ,e7t. The D’s are logical variables. A

value of 1 implies that the linked parameter has an incipientfault. A value of 0 implies no fault. This introduces addi-tional across time links,D2t → D2t+1, andD12t → D12t+1.In addition,Zt = R2t ,R12t. The set of possible fault hy-potheses covered by this DBN model include: (i) neitherR2or R12 faulty, (ii) R2 faulty,R12 not faulty, (iii)R2 not faulty,R12 faulty, and (iv)R2 andR12 faulty.

The DBN diagnoser model proposed in[Lerner et al.,2000] includes all possible faults in the system. However,the number of possible faults can be really large in complexsystems causing complexity issues in tracking diagnostic be-havior using a Bayesian approach. In our work, we reducethe set of possible fault hypotheses using the TRANSCENDscheme, and the DBN model for FII only deals with the ac-tive fault candidates when the qualitative scheme terminates.This reduces the size of the DBN diagnoser and it results in aconsiderable improvement in the efficiency of the diagnosis.

5 Fault Detection, Isolation and Identificationof Incipient Faults

This section presents the details of our methodology for im-plementing the different components of incipient fault diag-nosis scheme.

5.1 Tracking Nominal Behavior Using a DBNThe DBN observer captures the nominal state of the systemat every time stept. The set of nodesNt in the DBN andtheir distributions provide a snapshot of the system state.Asubset of these nodes,Yt , correspond to measured variables inthe system. The remaining variables belong to the set of sys-tem variables that cannot be measured, i.e.,Xt andZt . With-out loss of generality, we simplify the subsequent discussion,by considering only the variable setXt and ignoringZt . Thetracking problem for the system observer can be defined asderiving the posterior probabilityP(Xt |Y0:t) at every time stept.

The first order Markov assumptionreduces the computa-tion of the posterior probability to

P(Xt |X0:t−1) = P(Xt |Xt−1). (2)

Moreover, the state space model of a physical system de-fines the system output (i.e., the measured variables) as afunction of the state and the input variables. This implies,

P(Yt |X0:t ,Y0:t−1) = P(Yt |Xt). (3)

By combining equations (2) and (3), the tracking problemcan be defined as an iterative problem[Russell and Norvig,1995] defined as

P(Xt+1|Y0:t+1) = αP(Yt+1|Xt+1)∑Xt

P(Xt+1|Xt)P(Xt |Y0:t),

whereα is the normalizing constant. In this work, we as-sume that all random variables in the system are sampledfrom normal distributions. The noise models for the mea-surements are also assumed to be Gaussian with zero mean(white noise). Therefore, given prior probability distributionsand the measurement noise models, the posterior probability


f1

f3 e2

e7

f5

t

f8

f1

f3 e2

e7

f5

t+1

f8

(a) The DBN Observer

f1

f3 e2

e7

f5

t Discrete node indicating Presence/ Absence of

Fault

Node representing faulty Parameter

Node representing nominal parameter

f8

R12

R2

D12

D2

f1

f3 e2

e7

f5

t+1

f8

R12

R2

D12

D2

(b) The DBN Diagnoser

Figure 5: The Nominal and Fault DBN Models for the two tank system

computations are reduced to estimating the mean and vari-ances of the posterior Gaussian distributions.

The dependencies between the system variables may benon-linear, as is usually the case for real-life systems. Asa simplification, tracking of the DBN model can be imple-mented as anExtended Kalman Filter(EKF) [Bar-Shalomand Fortmann, 1988], which is a classical approach for solv-ing the tracking problem in such systems. The EKF approxi-mates the nonlinear dynamics with linear dynamics and thenuses the standard Gaussian model to update the system vari-ables at the next step. We adapt the EKF method[Narasimhanand Biswas, 2006] for tracking the nominal system behavior.

5.2 Incipient Fault DetectionThe fault detector continually monitors the measurementresidual,rt = yt − yt , whereyt ∈Yt are the measured variablesat timet, andyt are the expected value of the measurementsas determined by the DBN observer. Ideally,rt 6= 0 shouldimply a fault and trigger the fault isolation scheme, but to ac-commodate measurement noise and modeling errors we setup a statistical testing scheme to balance detection sensitivityagainst false alarms.

We start by defining a signal deviation at time stept interms of an average residual for the lastN2 samples, i.e.,

µN2t =1

N2

t

∑i=t−N2+1

r i .

A hypothesis testing scheme based on the Z-test is em-ployed to establish the significance of the deviation. To per-form the Z-test, the variance of the measurement residualmust be known. (For unknown variance the T-test may beperformed, but its confidence interval is much larger.) To ap-proximate the conditions necessary for the Z-test, the vari-ance of the signal is estimated, but from a larger data set con-tainingN1 samples, i.e.,N1 ≫ N2:

σ2N1t =

1N1−1

t

∑i=t−N1+1

(r i −µN1t

)2

TheZ−value has a distributionN(0,1):

Z =µσ√N2

. (4)

The confidence level, defined byα, defines the bound[z−,z+]: P(z− < z< z+) = 1−α. This bound can be trans-formed to another bound[µ−,µ+] using Eqn. (4), and the ap-proximationσ = σN1:

µ− = z−σ

√N2

, µ+ = z+σ

√N2

.

The Z-test is employed in the following manner:

µ− ≤ µ ≤ µ+ ⇒ no f ault

otherwise ⇒ f ault.

The advantage of this fault detection approach is that it iscomputationally simpler, and it makes no assumptions con-cerning the properties of the changed mean value (it does nothave to be constant). Once the fault is detected, the Z-testoutputs symbolically the direction of change of the observa-tion, based on the value of the mean. If the mean is negative,this implies that the measurements have decreased from theirnominal values, and a symbol− is output. If the mean ispositive, the observations have increased from the nominalvalues, and a symbol+ is output.

5.3 Qualitative Incipient Fault IsolationAfter fault detection, the DBN tracking is suspended and theTRANSCENDfault isolation scheme[Mosterman and Biswas,1999] is run on the TCG to generate the initial fault hypothe-ses given the first non-zero residual symbol(s). The TRAN-SCENDdiagnostic framework for abrupt faults is extended toincipient fault analysis by considering fault profiles thathavethe value(0,±) as was discussed in Section 2. Thebackwardpropagationscheme for generating the initial fault hypothesisremains the same.

For each fault hypothesis generated, a forward pass on theTCG, i.e., theforward propagationalgorithm generates thefault signatures. Propagation of a(0,+) or a(0,−) will pro-duce no discontinuous changes in the measured variables.Therefore, the predicted first effect of an incipient fault ona measurement can be expressed as one of three qualitativesymbols:+,0,−, which corresponds to a predicted grad-ual deviation above normal, no change, and a gradual devia-tion below normal, respectively, over some time interval. In


Fault e2 e7 f 3 f 5 f 8R1+ + + − + +R2+ + + + − −R12+ + − + − −

Table 1: Fault Signature Matrix

[Manderset al., 2000] we have established that only the firstchange in a measured signal provides information to differen-tiate among fault hypotheses, therefore, it is sufficient tojustrecord this first change,± as the fault signature. The faultsignatures for buildup of sediments in the three pipes of thetwo tank system (Fig. 3), causing their resistances to increaseare listed in Table 1. Continued monitoring of the remain-ing measurement deviations helps refine the fault hypothesesusing a matching process. If the observed deviation signalmatches the predicted signature value, the fault hypothesis isretained, otherwise it is dropped.

The qualitative fault isolation algorithm is designed to runfor at mosts steps, wheres is a pre-specified value. It mayturn out that a single fault is isolated before thes steps arecomplete, or multiple hypotheses may still be valid after thes steps. When qualitative isolation identifies a unique candi-date or thes steps are completed, the TCG based scheme isterminated and the FII module with the DBN diagnoser is ini-tiated. The number of stepss must be carefully chosen. Ifsis too small, it is very likely that few fault candidates willbedropped and the ensuing DBN-based FII procedure will notbe efficient. On the other hand, ifs is large we may delaythe isolation and identification tasks. A small number of re-maining fault candidates implies a few “fault nodes” have tobe introduced into the DBN diagnoser. This is good becausethe DBN approach is exponential in the number of number offault hypotheses that are introduced. Too many hypothesesincrease computation time and also the time to convergence.

5.4 Fault Isolation and Identification of IncipientFaults Using the DBN Diagnoser

Once the TCG based procedure completes running forssteps(or less thans steps if fault isolation completes earlier), theDBN diagnoser is modified to model the remaining fault hy-potheses and the DBN-based FII scheme is initiated.

We implement a single DBN that includes all of the cur-rent fault hypotheses, i.e., the fault hypotheses that are noteliminated by the TRANSCENDanalysis. Consider a specificscenario, where the TRANSCEND scheme reduces the faulthypothesis set toR2,R12. As discussed this introducesfour additional nodes into the system DBN, i.e.,R2,R12,D2,andD12. The set of possible fault hypotheses covered by theDBN model of the faulty system include: (i) neitherR2 orR12 faulty, (ii) R2 faulty, R12 not faulty, (iii) R2 faulty, R12not faulty, and (iv)R2 andR12 faulty. We assume that wehave enough measurements such that the system, even withthe addition of the faulty modes, is observable. The DBNFII scheme is initialized to the state of the system at timetd,when the fault was detected (see Section 2.1). This is becausetd − to is assumed to be small and error in starting the DBN-based FII scheme attd instead ofto is negligible for the ourdiagnosis approach. Recall that all observations have been

cached from the time a fault was detected and the DBN ob-server was suspended.

For the quantitative FII procedure, we adopt the proceduredetailed in[Lerneret al., 2000]. However, the computationalcomplexity of our approach is greatly reduced because westart with the pruned set of fault hypotheses obtained fromthe qualitative TCG analysis. We maintain the belief stateas a set of hypothesis, each of which corresponds to a singlemultivariate Gaussian distribution. A random variablept isintroduced for each hypothesis (each hypothesis is definedas a parameter value that has changed), and the distributionof pt corresponds to the likelihood for that fault hypothesis.Once the DBN with fault hypotheses is established, the sameprocedure for updating the likelihood for the nominal DBNcan be applied to adjust the weights and the parameters of themultivariate Gaussians as each hypothesis is conditioned onthe new measurementsYt+1.

As more observations are collected, the mean value for thetrue fault parameter changes gradually, whereas the means ofthe other non-faulty parameters do not change. Moreover, thevariances of each distribution should gradually decrease asmore measurements are obtained. Observing the sequence ofmeans, we can calculate the rate of change of the true fault pa-rameter, thereby fulfilling the identification task for incipientfaults. If at the end of the qualitative analysis, the set of faulthypotheses is refined to a singleton set containing only onefault, it implies that the system is diagnosable using the qual-itative diagnoser. In that case, we add only one fault mode tothe DBN-based diagnoser and the diagnoser is used solely forestimating the slope of the fault parameter.

6 ResultsIn this section, we present the results obtained by applyingtheproposed diagnosis approach to the two tank system shown inFig. 3(a). In such hydraulic systems, the accumulation of sed-iment in the pipes are common examples of incipient faults.These incipient faults are modeled as a gradual increase inthe pipe resistances and represented asR1+ R2+ andR12+.f 3, f 5, and f 8, the flow through the pipesR1, R12 andR2,respectively, are the measured variables for this experiment.

System behavior was generated for a total of 500 time stepsby simulation using the Simulinkr/MATLAB r environment.White noise (mean = 0, variance = 2% of the measured sig-nal) was added to the measurements. The measurements weresaved in a file, and then run through our incipient fault di-agnosis scheme (implemented in MATLAB) to generate ourexperimental results.

We now describe a run of our diagnosis approach for a spe-cific fault scenario. An incipient fault, i.e., a gradual buildupof resistance was introduced in pipeR12 at time-step,t = 200.The fault was modeled by a linear increase in theR12 para-meter at rate of 0.0014 per time unit.

The introduction of the faultR12+ first resulted in an de-crease from nominal forf 5, i.e., f 5 = −. The fault detectorZ-test signaled this deviation at time stept = 219, and thendetected a increase from nominal in the measured value forf 3, i.e., f 3 = +, and then an decrease from nominal forf 8,i.e., f 8 = − at time steps 266 and 371, respectively. This is


shown in Fig. 6, where the flows after the introduction of thefault are compared with the flow values estimated by the ob-server. The forward propagation along the TCG implicatedR2+ and R12+ as the possible fault candidates. The faultsignatures, shown in Table 1 were used to match against thesymbolic value of the measured variables. In this particularexperiment, at the end of the TCG based analysis,R2+ andR12+ remained as fault candidates as the deviations observedin f 3 and f 5 could not refute the possibility of either fault.

The DBN-based diagnoser, representing the fault modesR2+ and R12+, was appropriately initialized and restartedfrom the time of detection of the fault, i.e.,t = 219. All ran-dom variables in the DBN are assumed to be sampled fromnormal distributions with meanµp and varianceσp. Themeans of every parameter is updated across time steps as fol-lows:

[µe2t+1µe7t+1

]=

[1− 1

C1 ( 1R1 + 1

R12) 1C1R12

1C2R12 1− 1

C2 ( 1R2 + 1

R12

][µe2tµe7t

]+

[1

C10

]S ft

µ f 3t+1µ f 5t+1µ f 8t+1

=

1

R1 0−1R12

1R12

0 1R2

[

µe2tµe7t

]

[µR12t+1µR2t+1

]=

µe2t+1−µe7t+1

µ f 5t+1µe7t+1µ f 8t+1

At every step, the mean and variance of the distributionsof each parameter is updated and the estimated observationsare compared with the actual faulty behavior. As the esti-mates are conditioned on more evidence, i.e., measurements,the estimation of the true fault parameter should result in pre-dicted behavior models that match the measured system vari-ables, while the estimates obtained from the “other” hypothe-ses will produce estimates that imply no change in its parame-ter value, or the estimated change has a very low likelihoodgiven the measurements. The Z-test described earlier is ap-plied to the measured flow estimates corresponding to each ofthe four hypotheses to determine if there is a significant de-viation from the observed faulty measurements. If the Z-testdetermines a deviation in the residual for a certain hypothesis,that particular hypothesis is no longer considered to be valid.

In this way, att = 477, the deviation in estimates forR2+

is established using the Z-test, andR12+ is correctly isolatedas the true fault. The means of the distribution forR12 ateach time step fromt = 219 is logged and using standard leastsquare estimation, the slope of change is identified. The rateof change of the faulty parameter was identified to be 0.00138which is close to the actual injected rate of 0.0014 with apercentage error of 1.43%.

Fig. 6(a) shows the plots for (i) the estimated nominal flowf3 estimated by the observer, (ii) the measured actual flow f3with the fault injected att = 200, (iii) the estimated flow f3with R2+ as the only fault hypothesis, and, (iv) the estimatedflow f3 with R12+ as the only fault hypothesis. As the truefault isR12+, we can see that the estimated flow f3 withR12+

as the only hypothesis converges to the observed flow whereas the estimates of f3 withR2+ as the only hypothesis donot. Thus the Z-test detects a deviation forR2+ and henceit is dropped as the fault hypothesis, isolatingR12+ as the

Fault Rate Time of Time of Time Time of Estimatedof fault fault of DBN-based rate of

fault injection detection QFI FII faultR1+ 0.0021 200 205 305 305 0.0022R2+ 0.0022 200 218 343 449 0.0024R12+ 0.0014 200 219 371 477 0.00138

Table 2: Experimental results (all times are expressed as timesteps from the start of the experiment)

true fault. Similar plots for the flows f5 and f8 are shown inFig. 6(b) and Fig. 6(c). Table 2 summarizes the results forexperiments whereR1+ andR2+ are introduced as faults oneby one.

7 Conclusions

In this paper, we presented an efficient approach for diagno-sis of incipient faults using a combined qualitative and quan-titative DBN-based estimation scheme. The DBN-based FIIapproach allows for robust diagnosis under uncertainty thatcan be attributed to measurement noise and modeling errors.However, for large practical systems, the DBN based ap-proach becomes computationally very expensive. To addressthis issue, in our approach, the fault hypotheses is first refinedto a smaller set of candidates using qualitative fault isolationapproaches. The DBN is then built for this reduced numberof fault hypotheses alone making it more efficient than onewhich contains all possible fault hypotheses.

One issue that needs further investigation is the observabil-ity of the DBN diagnoser and its impact on diagnosis. For ex-ample, in the two tank system shown in Fig. 3, it is sufficientto measure the pressuree7 and the flowf 3 to uniquely isolatethe fault hypotheses (Table 1). However, for quantitative FII,it will be necessary to measure all three flows,f 3, f 5, and f 8in order to estimate the appropriate resistance values at eachtime step. The problem of identifying the correct set of mea-surements such that the system is diagnosable as well as theDBN is observable, therefore, is an interesting research issue.

In our experiments, we assumed that the prior and con-ditional probabilities for the DBN are all Gaussian. More-over, the parameters of the DBN were also assumed to havea Gaussian distribution. However, this is a rather strong as-sumption and we need to relax it and demonstrate the effi-ciency of our diagnosis scheme for more general systems.

Finally, even though the qualitative fault isolation proce-dure is designed for diagnosis of single faults, the DBN basedFII approach has no such restrictions. Hence, a natural exten-sion of this work would be to adapt it for the detection ofmultiple incipient faults. In future, we intend to also extendthis Bayesian approach to the diagnosis of both incipient andabrupt faults.

Acknowledgement

This work was supported in part by NSF CNS-0452067 andNSF CNS-0347440.


References[Bar-Shalom and Fortmann, 1988] Y. Bar-Shalom and T. E.

Fortmann. Tracking and Data Association. AcademicPress, 1988.

[Biswaset al., 2003] G. Biswas, G. Simon, N. Mahadevan,S. Narasimhan, J. Ramirez, and G. Karsai. A robustmethod for hybrid diagnosis of complex systems. InProc.5th IFAC Symp on Fault Detection Supervision SafetyTechnical Processes, pages 1125–1131, Washington, DC,June 2003.

[Daigleet al., 2006] M. Daigle, X. Koutsoukos, andG. Biswas. Distributed diagnosis of coupled mobilerobots. InProc. of 2006 IEEE International Conferenceon Robotics and Automation, May 2006. to appear.

[Karnoppet al., 2000] D. C. Karnopp, D. L. Margolis, andR. C. Rosenberg.Systems Dynamics: Modeling and Sim-ulation of Mechatronic Systems. John Wile & Sons, Inc.,New York, NY, USA, 3rd edition, 2000.

[Lerneret al., 2000] U. Lerner, R. Parr, D. Koller, andG. Biswas. Bayesian fault detection and diagnosis in dy-namic systems. InProc. of Seventeenth National Confer-ence on Artificial Intelligence, pages 531–537, 2000.

[Manders and Biswas, 2003] E.-J. Manders and G. Biswas.FDI of abrupt faults with combined statistical detectionand estimation and qualitative fault isolation. InProc. 5thIFAC Symp on Fault Detection Supervision Safety Tech-nical Processes, pages 347–352, Washington, DC, June2003.

[Manderset al., 2000] E.-J. Manders, S. Narasimhan,G. Biswas, and P. J. Mosterman. A combined qualita-tive/quantitative approach for fault isolation in continuousdynamic systems. InProc. 4th IFAC Symp on FaultDetection Supervision Safety Technical Processes, pages1074–1079, Budapest, Hungary, June 2000.

[Mosterman and Biswas, 1999] P. J. Mosterman andG. Biswas. Diagnosis of continuous valued systems intransient operating regions.IEEE-SMCA, 29(6):554–565,1999.

[Narasimhan and Biswas, 2006] S. Narasimhan andG. Biswas. Model-based diagnosis of hybrid systems.IEEE Transactions on Systems, Man, and Cybernetics,Part A, Sept. 2006. to appear.

[Roychoudhuryet al., 2005] I. Roychoudhury, G. Biswas,X. Koutsoukos, and S. Abdelwahed. Designing distrib-uted diagnosers for complex systems. InProceedings ofthe 16th International Workshop on Principles of Diagno-sis, pages 31–36, Monterey, California, June 2005.

[Russell and Norvig, 1995] S. J. Russell and P. Norvig.Ar-tificial Intelligence: A Modern Approach. Prentice-HallInc., 2nd edition, 1995.

150 200 250 300 350 400 450 5004.6

4.8

5

5.2

5.4

5.6

5.8

6

6.2

6.4x 10

−5

time (seconds)

flow

s (m

3 /s)

Nominal f3

Faulty f3

Estimated f3 with R2 as hypothesis

Estimated f3 with R12 as hypothesis

Fault injected

Fault Detected

QFI Complete

DBN BasedFII Complete

(a) Flow f3

150 200 250 300 350 400 450 5000.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5x 10

−5

time (seconds)

flow

s (m

3 /s)

Nominal f5Faulty f5Estimated f5 with R2 as hypothesisEstimated f5 with R12 as hypothesis

(b) Flow f5

150 200 250 300 350 400 450 5002.5

3

3.5

4

4.5

5

5.5x 10

−6

time (seconds)

flow

s (m

3 /s)

Nominal f8Faulty f8Estimated f8 with R2 as hypothesisEstimated f8 with R12 as hypothesis

(c) Flow f8

Figure 6: Tracking of flow measurements for two of the fourfault hypothesesR2+,R12+


Intermittent Fault Detection through Message Exchanges : a Coherence BasedApproach

Siegfried Soldani∗,∗∗, Michel Combacau∗,∗∗∗, Jerome Thomas∗∗, Audine Subias∗,∗∗∗∗

∗LAAS-CNRS, 7 Avenue du Colonel ROCHE FRANCE-31077 Toulouse Cedex 4∗∗ACTIA, 25 Chemin de Pouvourvillle - B.P. 4215 FRANCE-31432 Toulouse Cedex 4∗∗∗Universite Paul Sabatier, 118 route de Narbonne FRANCE-31062 Toulouse Cedex 4

∗∗∗∗INSA de Toulouse, 135 avenue de Rangueil FRANCE-31077 ToulouseCedex 4

Abstract

This paper deals with an approach for the detectionand localization of intermittent faults in discreteevents systems with partial observability. The pro-posed method is based on a discrete events modelrepresenting the normal functioning of the observ-able behavior of the monitored system. This modelbased on the Petri net formalism is built from thedesign data. The detection mechanism consists of acomparison between the flow of observable eventsemitted by the monitored system and the flow fore-seen by the model. A localization step completethe detection mechanism and points out the set ofevents potentially responsible for the faults. Thesetwo mechanisms are designed in order to operateon-board, in real time. An example from the auto-motive domain is presented.

1 IntroductionIn the whole of transportation systems, and particularly inthe automotive field, the on-board electronics are increasinglyused. These technologies allow the manufacturers to reducethe fabrication costs and also to provide enhanced functional-ities to the consumers.

In the same time, these evolutions lead to systems of a highlevel of complexity whose maintenance becomes more andmore difficult.

Presently, a typical on-board control system is constitutedfrom different devices (sensors, actuators, processors) con-nected each other by a network. These devices exchange datato fulfill the functions required by the optimal operation ofthevehicle. Therefore, on-board fault detection and diagnosis areessential for security reasons but also, for the satisfaction ofthe consumers.

In this paper, we propose an approach for design of an on-board detection and diagnosis system suitable for any net-worked architecture. We tackle the problem of fault detectionand diagnosis at a discrete events level because we are inter-ested in exchanging of data between the different devices ofthe control system. Naturally, our proposition is fully com-patible with local monitoring systems based on analytical orsymbolic descriptions of the controlled system. Section two

recalls the main works in the domain of discrete events di-agnosis and positions our proposition. The main results aregiven in section three: constitution of the monitoring model,detection and localization mechanisms. An application ex-ample is given in section four and the conclusion gives themost promising perspectives of this work.

2 Diagnosis and discrete events systemsFault detection and diagnosis of discrete event systems havebeen the subject of many studies. Thus, it has been pro-posed some Petri nets approaches to fault diagnosis[Boubouret al., 1997; Hadjicostis and Verghese, ; Valette, 1995;Jiroveanu and Boel, 2005; Genc and Lafortune, ; Lefebvreand Delherm, 2005], alarm correlation approaches[Jakob-son and Weissman, 1993; Nygate, 1995] and diagnoser ap-proaches[Contantet al., 2002; Lamperti and Zanella, 2003;Pencole and Cordier, 2005; Sampathet al., 1998]. In theseworks the model of the system to diagnose is a behavioralmodel including both normal operation and faults. For in-stance, the diagnoser is a finite state automaton built by com-pilation of a model of the system and in which each state isassociated to an hypothesis of fault. A fault is an unobserv-able event and a diagnosis is a trajectory that explains the ob-served sequence of events. These approaches give very goodresults for predictable faults. Indeed, to be detected and diag-nosed, a fault must be taken into account by the model of thesystem. It implies an accurate knowledge of the system andof the faults.

But, it is not always realistic to consider that all the faultscan be anticipated. In our domain of application, the automo-tive field, new vehicles are designed every year using com-ponents whose behavior cannot be guaranteed whatever theconditions of use. So, it is impossible exhaustively to foreseeall the faults. Moreover, in such systems, the faulty behaviorcan intermittently occur. It is a real problem in the automo-tive field because the diagnosis tools are designed to be usedoff-board, post-mortem and generally when the fault does notappear any more. Some studies have been undertaken un-der these hypotheses[Contantet al., 2002]. The intermittentfaults are described by the couple “occurrence event” and “re-set event”.

The approach suggested here only considers a behavioralmodel of on-board systems in normal operation. A differencebetween an observed sequence of events and the normal se-


quences of events shown by the model is considered to be asymptom of a fault. The proposed diagnosis method consistsin modifying the sequences of observed events to restore thecoherency with the model. In this way, the proposed methodcan take into account the lost of an event or the occurrence ofa spurious event, these two situations being the typical symp-toms of an intermittent fault. We must note that, if an inter-mittent fault does not result in a spurious event or a lackingevent, it will not be detected.

3 DESCRIPTION OF THE PRINCIPLEOur proposition consists of three steps: the modeling of sys-tem behavior (off-line), the detection method of the intermit-tent and fugitive faults (on-line), and a first reasoning of lo-calization (on-line).

3.1 Modeling of system behavior in normaloperation

Let us recall that we consider an architecture based on elec-tronics devices connected by a network. This set constitutesa control system providing functions or services to the user.Whatever the application domain, the behavior of these func-tions are modeled in normal operation. We consider thatthe evolutions of the function behavior can only be observedthrough the messages flowing between the devices on thecommunication network. Thus, the observable events are thecommunication events.

The model of function behavior is determined from theconception data of manufacturers. These data are modifiedin order to build a model which describes the different suc-cessive states in which the function is. Petri nets are used torepresent the sequential behavior. Petri nets are chosen be-cause they describe the synchronisation and the concurrencybetter than the other finite state models. Moreover, their for-malism is well suited to describe the detection and diagnosismechanisms. In a Petri net, the places represent the states ofthe function and the transitions describe the conditions whichinvolve a change of state and the actions which result from it.

Our model must evolve at every occurrence of an observ-able event. That is why an abstraction of the model is made toget a model in which only the transitions associated to the ob-servable events are represented. The principle of this abstrac-tion is to merge the places related by unobservable events.Indeed, according to the set of observable events, these statescannot be distinguished. That presents a problem of diag-nosability of the system. Indeed, if a large set of states aremerged in the model, it becomes difficult to discriminate themon-line. It seems that the question of diagnosability is relatedto the structure of the model of the function. This aspect ispresently under studies but is not under the scope of this pa-per.

By this abstraction, only a subset of system states may bediscriminated and this imprecision corresponds to the partialview of the function evolution through its observable events.

3.2 Detection of intermittent and fugitive faultsThe evolution of the system state is followed by the evolutionof the Petri net marking at each occurrence of an observable

event. The initial state is assumed known. For each mark-ing, a set of expected events (the set of enabled transitions)is determined on the Petri net. This set exhibits the normalevolutions of the system. When an observable event occurs,it belongs to the set of expected events and the Petri net mark-ing is updated (normal evolution) or it does not belong to thisset and an inconsistence is detected (symptom of a fault).

Formally, this detection mechanism can be described asfollow. First, let us recall the definition of a Petri net. Amarked Petri net (P-net) is a 5-tupleR =< P, T, I,O,M0 >,where P represents the set of places and T the set of transi-tions. M0 is the initial marking of places:M : P → N ,where the valueM0(p) is the number of tokens in place p. I isthe Input function:I : P ×T → N , where N is the set of nat-urals. The value I(p,t) is the weight of the arc from the placepto the transition t. O is the Output function:O : T ×P → N ,where the value O(t,p) is the weight of the arc from the transi-tion t to the place p. A transitionti can be fired for a markingMi iff Mi ≥ I(., ti) whereI(., ti) is theith column of thematrix representing the Input function. In the sequel, we as-sume that a Petri net describes the observable evolutions ofagiven function distributed on the devices of the network.

Let E be the set of observable events:

E=e1, e2, ..., en

P(E) defines the set of subsets of E. We define the applicationEt associating a set of events to a transition by:

Et : T → P (E)ti 7→ Et(ti) ⊆ E

P(T) defines the set of subsets of T. The applicationETFT

which associates to a marking a set of transitions called fire-able is such as:

ETFT : M → P (T )Mi 7→ ETFT (Mi) = ti/Mi ≥ I(., ti)

The setEexp(Mi) of events associated to firable transitionsfrom a marking i.e. the expected events is defined as:

Eexp(Mi) =⋃

ti∈ET F T (Mi)Et(ti)

The applicationETFE which associates a set of one or sev-eral transitions to an event is given by:

ETFE : E → P (T )ei 7→ ti/ei ∈ Et(ti)

Given a markingMi and the occurrence of an eventej , thesetEF (Mi, ej) of transitions that can be fired is:

EF (Mi, ej) = ETFE(ej)⋂

ETFT (Mi)

If EF (Mi, ej) 6= ∅, then a new marking M’ is reached by (lettk ∈ EF (Mi, ej) the transition fired):

M ′ = Mi − I(., tk) + O(., tk)

If EF (Mi, ej) = ∅, it means that it exists inconsistence be-tween the incoming eventej and the set of expected events.This is the mechanism of symptom detection.

This incoherence may have different causes:

• The received event spuriously occurs,


P1 P2 P3 P4t1 t2 t3

t3

Figure 1: Different cases of incoherence in a Petri net

• One or more events did not occur,

• An arc is lacking in the Petri net. There is a problem ofmodeling. In our works, we do not consider this situa-tion.

For example (Figure 1), the initial marking isM0 =[1, 0, 0, 0]T . The events associated to the transitions areEt(t1) = e1 , Et(t2) = e2 , Et(t3) = e3, and thetransitions associated to the events areETFE(e1) = t1,ETFE(e2) = t2, ETFE(e3) = t3. The fireable tran-sition is ETFT (M0) = ti/M0 ≥ I(., ti) = t1, thenthe expected events areEexp(M0) = e1 . If the first in-coming event ise1 then the set of transition associated toe1 is ETFE(e1) = t1. ThusETFT (M0)

⋂ETFE(e1) =

t1 (6= ∅). There is no incoherence. The new markingis M1 = [0, 1, 0, 0]T . In the same way, the new fireabletransition and the new expected event areETFT (M1) =t2 and Eexp(M1) = e2 respectively. Let us sup-pose that the second incoming event ise3. Then the setof transitions associated toe3 is ETFE(e3) = t3. ThusETFT (M1)

⋂ETFE(e3) = ∅. There is an inconsistency and

a symptom of fault is detected.Thus, the apparition of incoherence corresponds to a fault

detection in the system. This fault can be a permanent faultor, a fugitive or intermittent one. In the case of intermittentor fugitive fault, the detection does not bring any informationexcept the occurrence of the detection of this one. Therefore,it needs to save a maximum of relevant information in a reportin order to make a diagnosis of the system latter with the off-board diagnosis tool. In addition to these information, a firstreasoning can be lead so as to try to make a first localizationof the fault.

3.3 Localization reasoningLet be Sc = (e1, e2, ..., ed−1, ed) the sequence of last ob-servable events leading to an inconsistency whened occurs.We are under the hypothesis that a fault is constituted bya spurious event or by the lack of an event. The sequel ofthe fault processing consists in modifyingSc by deletingone of its events or by inserting an event ofE within Sc inorder to restore the consistency between the observationsand the model state trajectories. We define the localizationstep as follow. Given an evented not consistent with themodel trajectories (EF (Mi, ed) = ∅), we note the sequence

of events defined byS = (e1, e2, ..., ed−1) with MpS

7−→ Mi

andSc = (S, ed).

e2

e1

e5e3

e3e2

P0

P4P3

P2P1

P5 e4

Figure 2: Localization reasoning in a Petri net

Algorithm 1. Deleting MechanismFor eache

i∈ S

c,

1. Let builtS−

ci

= (e1, ..., ei−1, ei+1, ..., ed)

2. If S−

ci

is fireable from the markingMp, thene

iis

pointed as a possible localization of the fault.The algorithm complexity is bounded by dim(S

c).

Algorithm 2. Insertion MechanismFor eache

i∈ S

c, for eache

j∈ E,

1. Let builtS+

cij

= (e1, ..., ei−1, ej, e

i, ..., e

d)

2. If S+

cij

is fireable from the markingMp, thene

jis

pointed as a possible localization of the fault.The algorithm complexity is bounded bydim(S

c)×dim(E).

Remark: the sequenceS always points out the evented

as a possible localization of the fault.The results of the localization step is the set of events

that have been pointed out. The length of sequenceSc

to consider could be given by the control structure of thePetri net model. Indeed,S

cmust be so that the marking

Mp

can be considered as certain, i.e. no faults has occursbefore it has been reached in the model. This aspect ofthe method and the complexity reduction is currently thescope of studies and is not presented in this paper.

The example in Figure 2 illustrates this principle.From the initial state,S

c= (e2, e3, e4) leads to the de-

tection and two localizations are possible.

• by deletinge4, S−

c4= (e2, e3) is effectively a model

trajectory.

• by insertinge1 at the beginning of the sequence,S+

c21= (e1, e2, e3, e4) is a trajectory model, too.

The on-board localization of the faulty event is the firststep of the fault localization. Off-board, these data willbe used to localize the components responsible for thedetected default (for example, the component whichsent the faulty event).


4 APPLICATION IN AUTOMOTIVEINDUSTRY

4.1 Specificities of the automotive contextNowadays, new vehicles bring out every year. The num-ber of security systems, aiding systems and options forthe drivers comfort, have drastically increased last years.This evolution is due to the use of electronic devicesconnected by one or several local area networks. In mostof architectures, the devices named Electronic ControlUnit (ECU) are connected by a specific local network:the Controller Area Network (CAN).

Controller Area Network is based on a serial commu-nication protocol, which supports distributed real-timecontrol. In present (and future) cars, the functions pro-vided to the passengers are implemented by a collabo-ration of the different ECU. Our example concerns the“front wiping” function (see Figure 3). It is a very sim-ple function in the car, the model of this function is notof a high level of complexity, but it is sufficient to showhow our proposals works.

Actuators ECU

CAN

ON/OFF and speed selection ECU

Central ECU

Monitoring System

Figure 3: CAN Network and Monitoring System

This function is distributed on three different ECU:the “on/off and speed selection” ECU, the actuatorsECU and the central ECU (main processor of the ve-hicle). The central ECU manages the behavior of thefunction. It receives messages from the two others ECUand sends control messages to the actuators. For mon-itoring purposes, each ECU has an internal memory inwhich are registered all the foreseen faults (for instancea value of a variable growing out of its range). This faultregistering is used off-line by the mechanics to diagnosethe vehicles state. But, for unforeseen faults nothing canbe registered and for intermittent faults, the ECU indi-cates faults that are not present any more.

Our works are intended to give a solution to this kindof faults. In this architecture, the different ECU are“black boxes” and the real time monitoring can onlybe done through the observation of the messages ex-changed on the network. Let us see on this examplehow the detection and localization mechanisms previ-ously described can be great advantage.

p1

p2p3

p4

p0

t0

t11

t8

t6

t5

t3t4

t9

t10

t2

t1

t7

e3.e5

e1

e2

e1.e4

e4e3

e6

e5

e2.e3

e1.e6

e5.e4

e3.e6

Figure 4: Abstracted model of the function “front wiping”

4.2 Modeling of a distributed functionThe Petri net model of the function is extracted from thedesign data (in our case, the design data are given un-der the StateChart formalism[Harel, 1987]). The State-Chart shows the set of function states, the conditions ofthe state transitions and the actions associated to transi-tions and/or states. In the model derived from the Stat-eChart specification appear both observable events (themessages exchanged on the CAN network) and unob-servable events (internal events for an ECU).

By merging places linked by unobservable events, weobtain the monitoring model representing the exchangeof messages between the three ECU during normal op-eration of the front-wiping function.

The monitoring model is described in Figure 4. Thisfunction has five states:

p0 : initial state reached each time the power isswitched off,

p1 : stand-by (wiping suspended or stopped),

p2 : wiping low-speed,

p3 : maintenance,

p4 : wiping high speed.

The transition are labeled by sets of events whose indi-vidual significations are the following one:

e1 : maintenance request off,

e2 : maintenance request on,

e3 : low speed request off,

e4 : low speed request on,


e5 : high speed request off,

e6 : high speed request on.

On this model, different states of the initial functionmodel have been merged because there is no observableevents to discriminate them (for instance, placep1 hidefive places of the initial model).

4.3 Detection of a faultLet us suppose that the function is in a state rep-resented by the abstractionM(p1) = 1 in themodel. The set of expected events isE

exp(M(p1) =

1) = e2, e4, e6, the set of fireable transitions isE

TFT(M(p1) = 1) = t2, t3, t6. The occurrence of

evente2 with ETFE

(e2) = t2, t10 leads to the fir-ing of the transition of the setE

F(M(p1) = 1, e2) =

ETFE

(e2)⋂

ETFT

(M(p1) = 1) = t2 and so to thenew marking characterized byM(P3) = 1. There is noinconsistency between the observed event and the tra-jectory in the model.

Now let us suppose that, from markingM(p3) = 1occurs the evente3. The set of expected events isE

exp(M(p3) = 1) = e1, e1.e4, e1.e6 so the set of

transitions to fire is empty.

EF(M(p3) = 1, e3) = E

TFE(e3)

⋂E

TFT(M(p3) = 1)

= t0, t4, t7, t10⋂t1, t9, t11

= ∅

In this case a symptom of fault is detected because theevente3 does not match any fireable transition of themodel. The next step consists in localizing the possiblefaults.

4.4 fault localizationThe goal of the on-board monitoring system is to pro-vide to the mechanics all the information that can beuseful for the diagnosis. The localization step by giv-ing the possible scenarii leading to the detection of thesymptom is the main information to be collected. Ac-cording to the adopted notation the localization step be-gins with the following data :

Sc= (e2, e3)

Mp

= M(p1) = 1M

i= M(p3) = 1

E = e1, e2, e3, e4, e5, e6, e1.e6, e1.e4, e2.e3,e3.e5, e3.e6, e4.e5

By deleting one event of the sequenceSc

we get thetwo sequencesS−

c2= (e3) andS−

c3= (e2). From mark-

ing Mp(M(p1) = 1) only S−

c3is a fireable sequence.

This points out the evente3 as a possible fault localiza-tion.By adding to the sequence an event of the setE, wegenerate 24 sequences among which onlyS+

c3(1.4)=

(e2, e1.e4, e3) matches a possible sequence of firingfrom M(p1) = 1 in the Petri net. This points out thatthe lack of the evente1.e4 in the observed sequence isa possible localization of the fault. Finally, two possi-ble localizations are stored in the on-board monitoringsystem :

Mp

: M(P1) = 1S−

c3= (e2)

spurious evente3

Mp

: M(P1) = 1S+

c3(1.4)= (e2, e1.e4, e3)

lacking evente1.e4

Therefore, this first approach tries to bring a first ex-planation about what happens on the system, and triesto improve the knowledge about the context of fault ap-parition. These events are sent or not by an ECU. TheseECUs and all the components that are connected to it(actuators, sensors...) are suspected.The garage diagnosis tool will take advantage of this in-formation by limiting the diagnosis to the suspected el-ements.

5 CONCLUSIONThe aim of this paper is to describe a method for de-tection of intermittent faults by an on-board monitoringsystem based on discrete events models. The monitor-ing system has a partial view of the events emitted bythe monitored system and operates with a model of thenormal functioning only.

The considered approach is very interesting for auto-motive applications, domain in which intermittent faultslead to very awkward situations for the mechanics. Thelocalization of the fault i.e. the lacking event or the spu-rious event, gives some potentially useful informationto the mechanics (the components at the origin of thefault).

Technically, this on-board monitoring system is de-signed to be connected to the off-board diagnosis toolused by the mechanics in the garage.

Future works will focus on the principles of localiza-tion in order to determine the last certain marking (M

p

in the paper). A real application on a vehicle will bedeveloped to prove the efficiency of the proposition.

References[Boubouret al., 1997] R. Boubour, C. Jard,

A. Aghasaryan, E. Fabre, and A. Benveniste.A Petri net approach to fault detection and diagnosisin distributed systems. Part I: Application to telecom-munication networks, motivations and modeling.In Proceedings of the 36th IEEE Conference on


Decision and Control, pages 720–725, San Diego(USA), December 1997.

[Contantet al., 2002] O. Contant, S. Lafortune, andD. Teneketzis. Failure diagnosis of discrete event sys-tems: The case of intermittent faults. InProceedingsof the 41st IEEE Conference on Decision and Con-trol, pages 4006–4011, Las Vegas (USA), December2002.

[Genc and Lafortune,] S. Genc and S. Lafortune. Dis-tributed diagnosis of discrete-event systems usingPetri nets. InICATPN’03.

[Hadjicostis and Verghese,] C.N. Hadjicostis and G.C.Verghese. Monitoring discrete event systems usingPetri net embeddings. InICATPN’99.

[Harel, 1987] D. Harel. Statecharts: A visual formalismfor complex systems.Science of Computer Program-ming, 8:231–274, 1987.

[Jakobson and Weissman, 1993] G. Jakobson and M.D.Weissman. Alarm correlation. IEEE Network,7(6):52–59, 1993.

[Jiroveanu and Boel, 2005] G. Jiroveanu and R.K.Boel. Petri net model-based distributed diagnosis forlarge interacting systems. InProceedings of the 16thInternational Workshop on Principles of Diagnosis,DX’05, Monterey, California (USA), June 2005.

[Lamperti and Zanella, 2003] G. Lamperti andM. Zanella. Continuous diagnosis of discrete-eventsystems. InProceedings of the 14th InternationalWorkshop on Principles of Diagnosis, DX’03, pages105–112, Washington D.C. (USA), 2003.

[Lefebvre and Delherm, 2005] D. Lefebvre and C. Del-herm. Diagnosis with causality relationships and di-rected paths in Petri net models. InIFAC WorldCongress’05, Pragues (Czech Republic), July 2005.

[Nygate, 1995] Y. A. Nygate. Event correlation usingrule and object based techniques. InProceedings ofthe 4th Symposium on Integrated Network Manage-ment, pages 290–301, Santa Barbara (USA), 1995.

[Pencole and Cordier, 2005] Y. Pencole and M.-OCordier. A formal framework for the decentraliseddiagnosis of large scale discrete event systems and itsapplication to telecommunication networks. InArti-ficial Intelligence Journal, pages 164(1–2):121170,2005.

[Sampathet al., 1998] M. Sampath, S. Lafortune, andD. Teneketzis. Active diagnosis of discrete-eventsystems. IEEE Transactions on Automatic Control,43(7):908–929, 1998.

[Valette, 1995] R. Valette. Petri nets for control andmonitoring : Specification, verification and imple-mentation. InWorkshop on Analysis and Design of

Event-Driven Operations in Process Systems (ADE-DOPS), Imperial College, London (GB), 1995.


Distributed Trace Estimation with Asynchronous Local Clocks and ImperfectObservation Channels

Rong Su and Michel ChaudronSystem Architecture and Networking Group (SAN)

Department of Mathematics and Computer Science, Eindhoven University of TechnologyPO Box 513, 5600 MB Eidnhoven, The Netherlands. Email: r.su,[email protected]

Abstract

In conservative (or complete) trace-based dis-tributed fault diagnosis it is common to compute,for each component, a preliminary local estimateconsisting of all possible traces that can generatethe same local symptom (i.e. a sequence of ob-servable events) as the one being received fromthat component up to a specific time instant. Thencommunication among local trace estimators canbe used to refine those preliminary local estimates.One approach has been proposed by Su and Won-ham to perform such refinement in terms of achiev-ing an appropriate consistency among local esti-mates. But implementation of that approach re-quires synchronization of local clocks within lo-cal estimators and flawless observation channels interms of no delay or missing of arrival of observ-able events. In this paper we will modify that ap-proach so that its implementation is independentof those requirements at the expense of potentialdegradation of quality of estimation.

1 IntroductionTimely and effective fault diagnosis is important to maintainperformance of an industrial system. There has been alarge volume of research on this topic for applications thatcan be modeled as discrete-event dynamic systems, e.g.centralized approaches [6] [5] [11], decentralized approaches[1] [7] and distributed approaches [2] [3] [8]. It is knownthat centralized approaches usually provide best diagnosticquality in terms of the number of fault candidates that canexplain a target system’s abnormal behavior. Nevertheless,they suffer modeling, computational and implementationdifficulties such as high space complexity and weak scalabil-ity/robustness. The high space complexity is also commonin decentralized approaches, typically during the diagnosersynthesis stage when a model of entire system is required. Toovercome those difficulties, especially the space complexity,in distributed approaches, e.g. [3] [8], the target system ismodeled as a collection of local components, each of which ismonitored by a dedicated local diagnoser. The task of a localdiagnoser is to compute first a preliminary local estimate(of either states or traces) about the corresponding local

component, then communicate with other local diagnosersto refine the local estimate. [3] provides a good review ofPetri-net based distributed diagnosis, while a language-baseddistributed framework and its connection with related workin computer science, e.g., [17] [15] [14] [12] [13] [16], issummarized in [8]. In [8] each preliminary local estimateis modeled as a collection of traces, and trace estimationis done by a local estimator within the corresponding localdiagnoser. The refinement is instantiated as a process ofachieving appropriate consistency among local estimates.Some effective algorithms are provided in [8]. Once each lo-cal diagnoser has an estimate of the target component’s trace,it can derive the fault status of the component accordingly.

It turns out that, to implement the proposed distributedtrace estimation approach in [8], synchronization of localclocks equipped in local estimators is necessary in orderto avoid inconsistency of local estimates owing to timemismatch of local observations; furthermore, wheneverthere is an observation from a component to its dedicatedlocal estimator, its arrival should not be delayed or missing.Clearly, these requirements, namely synchronization of localclocks and flawless observation channels, are too strongto be guaranteed during implementation, especially in adistributed network. Therefore, in this paper we will modifythe approach proposed in [8] so that the implementation willbe independent of those requirements. More explicitly, wewill use postfix-closed sublanguages during computation ofpreliminary local estimates to overcome the potential timemismatch among local clocks, and revise projection mapsto handle potentially distorted or missing observations. Theconsequence is that each resulting local trace estimator canbe switched on and off freely, thus flexibility of implemen-tation is achieved. Of course, there is no free lunch - theindependence of the proposed approach on the perfect timingand the accuracy of observations leads to a degraded qualityof estimation, which provides a vivid illustration of therelationship between what we can observe and what we caninfer.

This paper is organized as follows. In Section II we provide abrief introduction about basic concepts of traces, languages,clocks and mappings such as projections and synchronousproduct. Then we review in Section III the approach pro-


posed in [8] with the assumption that local clocks are syn-chronized and local observation channels are flawless. Afterthat, we first remove the assumption of synchronization oflocal clocks and provide a distributed trace estimation proce-dure in Section IV, then we propose in Section V an approachto handle situations when local observation channels may bedistorted. Conclusion is drawn in Section VI.

2 Concepts of clocks and time-stamped traces2.1 Languages and related mappingsIn this paper we follow the notation rules of [10]. Let Σ be analphabet, where each element in Σ is called an event. A trace(or string) over Σ is a finite sequence of events taken from Σ.Let Σ+ be the set of all possible string on Σ. Let ε representa special string - the empty string, which is not contained inΣ+. Let Σ∗ := Σ+ ∪ ε. A language is a subset of Σ∗.A string s1 ∈ Σ∗ is said to be a prefix substring of anotherstring s2 ∈ Σ∗, written s1 ≤ s2, if there exists s3 ∈ Σ∗ suchthat s2 = s1s3. Let L ⊆ Σ∗. We say L ⊆ Σ∗ is the prefixclosure of L if

L = s ∈ Σ∗|(∃s′ ∈ L) s ≤ s′

L is said prefix closed if L = L. Given two languagesA, B ⊆ Σ∗ let AB = ab|a ∈ A& b ∈ B denote the set ofstrings generated by concatenation. If A is a singleton, saya, then we write aB to mean aB.

Let Σ′ ⊆ Σ. The natural projection P : Σ∗ → Σ′∗ isdefined as follows:

1. P(ε) = ε

2. (∀σ ∈ Σ) P(σ) =

σ if σ ∈ Σ′

ε if σ /∈ Σ′

3. (∀sσ ∈ Σ∗) P(sσ) = P(s)P(σ)

The inverse image function of P is P−1, defined byP−1 : Pwr(Σ′∗) → Pwr(Σ∗) :

U → P−1(U) := s ∈ Σ∗|P(s) ∈ U

If U = s, a singleton, we write P−1(s) for P−1(s).

Given alphabets Σ1 and Σ2, let P1 : (Σ1 ∪ Σ2)∗ → Σ∗

1and P2 : (Σ1 ∪ Σ2)

∗ → Σ∗2 be two natural projections.

The synchronous product of two languages L1 ⊆ Σ∗1 and

L2 ⊆ Σ∗2 is defined as: L1||L2 = P−1

1 (L1) ∩ P−12 (L2). It is

shown [10] that the synchronous product is associative andcommutative.

The following notation involving natural projections willbe used freely. Let I be a finite index set. Unless specifiedotherwise, I = 1, 2, · · · , n ⊂ N. Let Σi|i ∈ I be afamily of alphabets.

1. For J ⊆ I write ΣJ := ∪j∈JΣj .

2. For J, K ⊆ I let PJ,K : Σ∗J → (ΣJ ∩ΣK)∗ be the natu-

ral projection from (event set) ΣJ to ΣK . If J (or K) isa singleton set, say J = j (or K = k), then we sim-ply use j (or k) to denote J (or K) in the correspondingnotation of natural projection.

3. For Σ′′ ⊆ Σ′ ⊆ ΣI let PΣ′,Σ′′ : Σ′∗ → Σ′′∗ be thenatural projection, if no other rule applies.

2.2 Clocks and time-stamped tracesDefinition 2.1 Let Σi|i ∈ I be a family of alphabets.A distributed model L is a set of prefix closed languagesL := Li ⊆ Σ∗

i |i ∈ I, where Li is a local component ofL. There is a subset Σio ⊆ Σi called observable event setof Li, which is not necessarily pairwise disjoint with otherobservable event sets.

The basic setup for each local component Li (i ∈ I) issimilar to the one used in [6]. Li represents the transitionbehavior of a local component. The observable event setΣio contains all events (or actions) that can be observed incomponent Li. A local symptom of a component Li is simplyan observable string in Σ∗

io. A distributed trace estimationprocess is aimed to compute a collection of traces for eachlocal component that can generate the same local symptom asbeing collected up to a specific time instant. Such traces canbe used to derive more application-oriented information, e.g.the fault status of a component, which is briefly described asfollows.

As originated from [6] [5], each local alphabet Σi containsa subset Σif ⊆ Σi, where each element σ ∈ Σif is calleda fault event denoting a single fault, e.g., σ may stand for avalve being stuck open, or a router being down etc. A subsetU ⊆ Σif is called a compound fault which is a collection ofsingle faults. Usually we assume that faults are unobservableto make the fault diagnosis problem nontrivial. A compo-nent is faulty if its true trace contains at least one fault event.Thus, the objective of fault diagnosis is to first decide whethera component’s trace contains any fault event (i.e., fault de-tection); and then determine which fault events are contained(i.e., fault identification). To that end, for Li we define a localfault report map

Ri : Σ∗i → 2Σif : s → Ri(s) := σ ∈ Σif |σ ∈ s

where σ ∈ s denotes that the event σ appears in the string s atleast once. The set Ri(s) is a compound fault containing allsingle faults that can occur if Li executes s. Suppose we ob-tain a local estimate Ei(t), which contains all possible tracesof Li with respect to a specific local symptom up to the timeinstant t. Then the set Di(t) := Ri(s)|s ∈ Ei(t) is calledthe local diagnosis of Li based on Ei(t). From Di(t) we caninfer that: (1) each single fault in

⋂s∈Ei(t)

Ri(s) must haveoccurred because it is contained in every possible trace; (2)each single fault in

⋃s∈Ei(t)

Ri(s) −⋂

s∈Ei(t)Ri(s) may

have occurred or may not. Thus, we think that the essentialstep in fault diagnosis is to obtain a “good” local estimate foreach local component, which is the main objective of thispaper.

A tuple of traces T = si ∈ Σ∗i |i ∈ I ∈

∏i∈I Σ∗

i

is called a permissible trace tuple of L if ||i∈Isi = ∅.Since the order of elements in the tuple is not important,here we use the set notation to express the tuple. As an


illustration, suppose a system consists of two components Aand B, where component A is modeled by a prefix-closedlanguage LA = ab, ba with ΣA := a, b, and componentB is modeled by a prefix-closed language LB = cab withΣB := a, b, c. The set of traces a, ca is a permissivetrace tuple because a||ca = ca = ∅. This collectioncorresponds to the scenario that component A executesthe trace a, meanwhile component B executes ca. Theset of traces a, c is not a permissive trace tuple becausea||c = ∅. This can be interpreted as follows: sinceevent a is shared by both components, execution of a mustbe done simultaneously in both components. Therefore,component A cannot execute a when component B executesc. Similarly, the set a, cab is not a permissive trace tuple.Let Ω(L) be the set of all permissive trace tuples of L.Intuitively, a system is captured by the model L if everypossible transition behavior of that system is captured by apermissive trace tuple of L. We formalize this intuition asfollows.

We first attach time-stamps to every permissive trace tuple.For each i ∈ I define a projection map πi :

∏j∈I Σ∗

j → Σ∗j

such that

(∀S = sj|j ∈ I ∈∏

j∈I

Σ∗j )πi(S) := si

Thus, πi projects a tuple S to the constituent string si. Let be a partial order on

∏i∈I Σ∗

i such that for any pair U =si|i ∈ I, U ′ = s′i|i ∈ I ∈

∏i∈I Σ∗

i ,

U U ′ ⇐⇒ (∀i ∈ I)ui ≤ u′i

A subset U ⊆∏

j∈I Σ∗j is componentwise prefix-closed if for

each U ∈ U and i ∈ I ,

(∀s ∈ Σ∗i ) s ≤ πi(U) ⇒ (∃U ′ ∈ U) s = πi(U

′),

namely, for each constituent string πi(U) of U , every prefixsubstring s of πi(U) is contained in some other tuple U ′ ∈ U .Let R

+ be the set of all nonnegative reals. A global evolutionof L is a map T : R

+ → Ω(L) such that the following condi-tions hold:

1. (∀t, t′ ∈ R+) t ≤ t′ ⇒ T (t) T (t′)

2. T (R+) is componentwise prefix-closed.

R+ essentially denotes the set of all time instants. Thus, T

attaches each time instant to a specific permissive trace tuple.The resulting sequence of permissive trace tuples describeshow the system represented by L evolves along with theelapse of time. The system may have an infinite numberof global evolutions, but each run of the system, from thesystem being turned on to being turned off, is described byonly one global evolution.

Definition 2.2 A local clock for component Li (i ∈ I)is a strictly monotonically increasing continuous mapTi : R

+ → R+. Two local clocks Ti and Tj are synchronous

if Ti = Tj . Otherwise, Ti and Tj are asynchronous.

The reason that a local clock is a strictly monotonically in-creasing continuous map is that the elapse of time is strictlymonotonically increasing and continuous. Suppose the globalevolution is T . At each time instant t measured by the globalclock, for each i ∈ I , the tuple (πi(T (t)), Ti(t)) ∈ Σ∗

i × R+

is called a locally time-stamped local trace with respect toT and the local clock Ti, where Ti(t) is the local measure-ment of t by the local clock. Correspondingly, the tuple(Pi,o(πi(T (t))), Ti(t)) is called a locally time-stamped lo-cal symptom of component Li. The corresponding permissivesymptom tuple at the global time instant t with respect to T is

O(t) = Pi,o(πi(T (t)))|i ∈ I ∈∏

i∈I

Σ∗io

Let Θ(L, T ) = O(t)|t ∈ R+ be the collection of all per-

missive symptom tuples of L with respect to T . Since T is aglobal evolution, for any two time instants t, t′ ∈ R

+, we getthat

t ≤ t′ ⇒ O(t) O(t′)

We can also show that O(t)|t ∈ R+ is componentwise

prefix closed, owing to the fact that T is componentwise pre-fix closed. Thus, along with the global evolution T the cor-responding set of permissive symptom tuples also form anevolution-style sequence, which tells how the observable be-haviors of the system evolve. We now ready to describe dis-tributed trace estimation under different assumptions.

3 Distributed Trace Estimation withsynchronous local clocks

Each local component is equipped with a trace estimator thatcan record every locally time-stamped local symptom. Adistributed trace estimation process usually (or conveniently)starts with a command issued by a special agent, e.g. either acontroller or a local diagnoser which detects abnormality inthe system, telling all local estimators to take actions. Thecontent of the command can be simply a value of a timeinstant, say t ∈ R

+, which is called a cutoff instant. Weassume that a cutoff instant must be a past time instant foreach local clock, thus, no ‘future’ behavior is consideredfor any local component. Once a local estimator receivesthe cutoff instant t, it checks its data log to find the locallytime-stamped local symptom (ui, t). Notice that the cutoffinstant t can be thought as a time instant measured by theglobal clock, which is then interpreted as a local time instantby every local estimator because the global clock is unknownto any local estimator. It is these local interpretationsthat makes trace estimation with synchronous local clocksdifferent from the case with asynchronous local clocks. Wewill see such difference shortly.

The collection of all locally time-stamped local symptomsat the cutoff instant t forms a symptom tuple with local time-stamps W(t) = (ui, t)|i ∈ I ∈

∏i∈I [Σ

∗io × R

+]. If alllocal clocks are synchronous, then they are the same map,which means

ui = Pi,o(πi(T (T−1i (t))))|i ∈ I ∈ Θ(L, T )

Then the distributed trace estimation consists of two steps:


1. Compute a preliminary trace estimate for Li (i ∈ I):

Mi(t) := P−1i,o (ui) ∩ Li

Here we follow the conservative estimation principle,which requires to obtain all traces that can generate thelocal symptom ui.

2. Refine preliminary trace estimates by achieving appro-priate consistency as follows.

In [8] two types of consistency are introduced. Sup-pose the preliminary local estimates at the cutoff instant t

form a tuple M(t) = Mi(t) ⊆ Li|i ∈ I ∈∏

i∈I 2Σ∗i .

A tuple E(t) = Ei(t) ⊆ Mi(t)|i ∈ I ∈∏

i∈I 2Σ∗i

is globally consistent with respect to I if for all i ∈ I ,Ei(t) = PI,i(||j∈IEj(t)). The concept of global consistencycan be interpreted as follows. For each local estimate Ei(t),knowing all other local estimates Ej(t) (j = i) will nothelp to further reduce redundant information in Ei(t). Sucha globally consistent E(t) is called a global support ofM(t). Let Γ(M(t)) be the set of all global supports ofM(t). Clearly Γ(M(t)) is not empty because it containsthe trivial support Ei(t) = ∅|i ∈ I. We define a partialorder ≤ in the cartesian product

∏i∈I 2Σ∗

i as follows:Ei|i ∈ I ≤ Ei|i ∈ I iff for all i ∈ I , Ei ⊆ Ei. It isshown [9] that Γ(M(t)) equipped with the partial order ≤forms a join-lattice [4]. Thus, the greatest element exists,which is called the supremal global support of Γ(M),written SupΓ(L).

Proposition 3.1 [8] SupΓ(M) = PI,i(||j∈IMj)|i ∈ I.

An algorithm called Computational Procedure for GlobalConsistency (CPGC) is proposed in [8], which can computethe supremal global support efficiently.

Similarly, we say a tuple E(t) = Ei(t) ⊆ Mi(t)|i ∈ Iis locally consistent with respect to I if for all i, j ∈ I ,Pi,j(Ei(t)) = Pj,i(Ej(t)). Since

Pi,j(Ei(t)) = Pj,i(Ej(t)) ⇐⇒ Ei(t) =Pi,j,i(Ei(t)||Ej(t)) ∧ Ej(t) = Pi,j,j(Ei(t)||Ej(t))

the concept of local consistency can be interpreted asfollows: for each local component Li, knowing the localestimate Ej(t) of an adjacent local component Lj (one withΣi ∩ Σj = ∅) will not refine Ei(t). Local consistencyinvolves only information contained in adjacent neighborsinstead of information from all components as requiredin global consistency. Less information usually leads tocoarser estimates in the sense that more irrelevant tracesare contained in each local estimate. We call the locallyconsistent E(t) a local support of M(t). Let Λ(M(t)) bethe set of all local supports of M(t). Λ(M) is not emptybecause it contains the trivial support Ei = ∅|i ∈ I. Itis shown [9] that Λ(M(t)) equipped with the partial order≤ forms a join-lattice. Thus, the greatest element exists,which is called the supremal local support of Λ(M(t)),written SupΛ(M(t)). An turbo-style algorithm called

Computational Procedure for Local Consistency (CPLC) isproposed in [8] to compute the supremal local support.

The trace estimation procedure with synchronous localclocks is summarized as follows:

1. Compute preliminary local estimates:

(∀i ∈ I) Mi(t) := P−1i,o (ui) ∩ Li

Let M(t) = Mi(t)|i ∈ I.

2. Achieve global or local consistency:

E(t) := Ei(t)|i ∈ I =

SupΓ(M(t)) (globally)SupΛ(M(t)) (locally)

Proposition 3.2 Suppose the global evolution is T , and alllocal clocks Ti|i ∈ I are synchronous. For each timeinstant t ∈ R

+, there is a cutoff instant t′ = Ti(t) (i ∈ I).LetW(t′) = (ui, t

′)|i ∈ I be the symptom tuple with localtime-stamps, M(t′) the tuple of preliminary local estimatesbased on W(t′) and E(t′) = Ei(t

′)|i ∈ I the supremalglobal support (or supremal local support) of M(t′). Thenfor each i ∈ I , πi(T (t)) ∈ Ei(t

′).

Prop. 3.2 essentially says that, with synchronous localclocks, each local estimator can always capture the true lo-cal trace within its local estimate.

4 Distributed Trace Estimation withasynchronous local clocks

In practical applications it may be technically difficult or fi-nancially expensive to make all local clocks perfectly syn-chronous, especially when the system consists of compo-nents locating over a geographically wide area, e.g. a na-tional electrical power distribution network or a telecommu-nication network. As before, suppose the system is L and aglobal evolution of L is T . We now assume that local clocksTi|i ∈ I are asynchronous. Suppose the cutoff instant is tand the corresponding symptom tuple with local time-stampsis W(t) = (ui, t)|i ∈ I. By the definition of W(t), eachui (i ∈ I) is a local symptom associated with the local traceπi(T (T−1

i (t))), where T−1i (t) is the corresponding time in-

stant measured by the global clock and T (T−1i (t)) is the

corresponding permissive trace tuple. Since local clocks areasynchronous, it is possible that there exist i, j ∈ I such thatT−1

i (t) = T−1j (t), namely the corresponding local symptoms

ui and uj are actually associated with permissive trace tu-ples at different time instants measured by the global clock.Therefore, if we still apply the same approach presented inthe previous section, namely computing preliminary local es-timates:

Mi(t) = P−1i,o (ui) ∩ Li and Mj(t) = P−1

j,o(uj) ∩ Li

then it is possible that we get Mi(t)||Mj(t) = ∅. If thathappens, then after communication among local estimators,Ei(t) = Ej(t) = ∅ when global consistency is used, or when


local consistency is used and Σi ∩ Σj = ∅. Therefore, localestimators associated with Li and Lj fail to track the actualtraces of Li and Lj over time.

As an illustration, suppose a system consists of two localcomponents L1 = abba, aaa and L2 = cbcb, ccc, whereΣ1 = a, b, Σ2 = b, c and Σ1o = Σ2o = b. Suppose aglobal evolution T is as follows:

T (t) =

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

a, c if 0 ≤ t < 1ab, cb if 1 ≤ t < 3ab, cbc if 3 ≤ t < 4abb, cbcb if 4 ≤ t < 6abba, cbcb if t ≥ 6

The local clock T1(t) = t + 0.1, namely it is 0.1 time unitfaster than the global clock (which is unknown to both localcomponents); and the local clock T2(t) = t + 0.5. Supposethe cutoff instant is t = 4.3. Then the locally time-stampedlocal symptom for L1 is (bb, 4.3), which is associated withthe permissive trace tuple

T (T−11 (4.3)) = T (4.2) = abb, cbcb

Correspondingly, the locally time-stamped local symptom forL2 is (b, 4.3), which is associated with the permissive tracetuple

T (T−12 (4.3)) = T (3.8) = ab, cbc

Clearly, these two local symptoms are not consistent becauseb is shared by both components and execution of b must bedone simultaneously in both components. If we still apply theestimation approach described in the previous section, thenafter communication we will get E1(4.3) = E2(4.3) = ∅,which suggests that each local estimator fails to track itstarget component.

So far we have seen that time mismatch among local clocksmay result in inconsistent local symptoms. But in each lo-cal component every locally time-stamped local symptom isactually associated with a permissive trace tuple - the actualtrace of the system at some time instant. Suppose the cutoffinstant is t, which is assumed to be a past time instant for eachlocal clock. Then global time instants T−1

i (t)|i ∈ I can bearranged in an ascending order. Without losing generality,suppose such an order is

T−11 (t) ≤ T−1

2 (t) ≤ · · · ≤ T−1n (t) where I = 1, 2, · · · , n

By the definition of T , we have

T (T−11 (t)) T (T−1

2 (t)) · · · T (T−1n (t))

which means that for each i ∈ I , the local symptom ui canbe extended to πi(T (T−1

n (t))) by concatenating an observ-able string. Therefore, although the locally time-stampedlocal symptoms (ui, t)|i ∈ I may be inconsistent owingto time mismatch of local clocks,

∏i∈I uiΣ

∗io must contain

local symptoms that are consistent, because uiΣ∗io (i ∈ I)

contains every possible local symptom of Li that containsui as a prefix substring (i.e. received after ui) includingπi(T (T−1

n (t))). Suppose the symptom tuple with local time-stamps is W(t) = (ui, t)|i ∈ I. We now provide a revisedtrace estimation procedure:


(∀i ∈ I) Mi(t) := P−1i,o (uiΣ

∗i,o) ∩ Li = P−1

i,o (ui)Σ∗i ∩ Li



A(t) := Ai(t)|i ∈ I =


3. Compute final local estimates:

(∀i ∈ I) Ei(t) := Ai(t) ∩ P−1i,o (ui)

Let E(t) := Ei(t)|i ∈ I.

Proposition 4.1 For each t ∈ R+ there exists a cutoff instant

t′ ∈ R+ such that the local estimates E(t′) := Ei(t

′)|i ∈ I,computed by the proposed procedure above (under eitherglobal consistency or local consistency) based on a symptomtuple W(t′) with the local time-stamp t′, has the propertythat for each i ∈ I , πi(T (t)) ∈ Ei(t′).

Notice that πi(T (t)) (i ∈ I) stands for the actual traceof component Li at the global time instant t. Thus, Prop.4.1 essentially says that the actual trace of Li can always becaptured by the corresponding local estimator sooner or later.We can also show the following result.

Proposition 4.2 Given a symptom tuple O = ui ∈ Σ∗io|i ∈

I, suppose E is computed by the algorithm with syn-chronous local clocks, E by the algorithm with asynchronouslocal clocks, under the same type of consistency (eitherglobal or local). Then E ≤ E .

Prop. 4.2 says that if local clocks are asynchronous,namely timing information does not exist any more, then thequality of trace estimation degrades, compared with the casewhen all local clocks are perfectly synchronous. As an illus-tration, we now redo that toy example described above. Thecutoff instant is still t = 4.3. The resulting symptom set withlocal time-stamps is W(4.3) = bb, b. Then the preliminarylocal estimates are:

M1(4.3) = P−11,o(bb)Σ

∗1 ∩ L1 = abb, abba

M2(4.3) = P−12,o(b)Σ

∗2 ∩ L2 = cb, cbc, cbcb

Then communication is used to achieve (global or local) con-sistency. The result is

A(4.3) = A1(4.3) = abb, abba, A2(4.3) = cbcb

Finally, the local estimate for L1 is:

E1(4.3) = A1(4.3) ∩ P−11,o(bb) = abb, abba

and the local estimate for L2 is:

E2(4.3) = A2(4.3) ∩ P−12,o(b) = cb, cbc

From those local estimates we can infer that, by the time in-stant t = 4.3 measured by the local clock T1, the trace of thecomponent L1 is either abb or abba; similarly, the trace of thecomponent L2 at the local time t = 4.3 is either cb or cbc.


5 Distributed Trace Estimation with Distortedobservation channels

Besides the negative impact on the quality of estimationcaused by asynchronous local clocks, the distorted obser-vation channels can also affect the quality of estimation.In practical applications each observable event in a localcomponent is converted from a measurement collected by asensor. An observation channel consists of one sensor andthe connection between the sensor and the recipient localestimator, which is responsible for fetching measurementsfrom the sensor to the estimator. An observation channelcan be distorted in the following ways: (1) the sensor mayoccasionally fail to generate any measurement; (2) the sensormay generate an inaccurate measurement; (3) the connectionhas latency that cannot be neglected. Case (1) results in miss-ing observable events, Case (2) leads to incorrect observableevents, and Case (3) causes the delay of the arrival of anobservable event in the estimator. The last case actually canbe treated as timing mismatch caused by asynchronous localclocks. So we only consider the first two cases, which can betreated in a uniform way described as follows.

For each component Li (i ∈ I), we define a map

fi : Σio → 2Σio∪ε

such that for each σ ∈ Σio, σ ∈ fi(σ). For example,fi(a) = a, b, c, ε means the event a may be either lost, asrepresented by ε, or incorrectly converted to event b or c ow-ing to inaccurate measurements from the sensor. Of course,we expect that at most time a should be correctly received asevent a, thus, a ∈ fi(a). Informally speaking, fi describesour prior knowledge about the channel. Next, we define anew map Fi : Σ∗

i → 2Σ∗io as follows:

1. Fi(ε) = ε

2. (∀σ ∈ Σi)Fi(σ) =

fi(σ) if σ ∈ Σio

ε otherwise

3. (∀sσ ∈ Σ∗i )Fi(sσ) = Fi(s)fi(σ)

We call such an Fi a distortion map. The inverse image func-tion of Fi is F−1

i , defined by

F−1i : 2Σ∗

io → 2Σ∗i :

U → F−1i (U) := u ∈ Σ∗

i |Fi(u) ∩ U = ∅

If U = u, a singleton, we write F−1i (u) for F−1

i (u).

We say the system L is observation distorted with respectto distortion maps F = Fi|i ∈ I if for each global evolu-tion T and each global time instant t′ ∈ R

+, the locally time-stamped local observation at Ti(t

′) is (ui(Ti(t′)), Ti(t

′)),where ui(Ti(t

′)) ∈ Fi(πi(T (t′))). As before, the symp-tom tuple with local time-stamps at a cutoff instant t ∈ R

+

is the collection of locally time-stamped local observationsW(t) = (ui(t), t)|i ∈ I. We now provide a new traceestimation procedure as follows:


(∀i ∈ I) Mi(t) := F−1i (ui)Σ

∗i ∩ Li



A(t) := Ai(t)|i ∈ I =


3. Compute final local estimates:

(∀i ∈ I) Ei(t) := Ai(t) ∩ F−1i (ui)

Let E(t) := Ei(t)|i ∈ I.

The only difference between the above procedure and the oneprovided in Section 4 is that, here we use the map Fi insteadof the natural projection Pi,o. F−1

i (ui) contains all stringsthat may produce the observation ui owing to the distortionof the observation channel.

Proposition 5.1 Suppose the global evolution is T and Lis observation distorted with respect to a set of distortionmaps F = Fi|i ∈ I. Then for each t ∈ R

+ thereexists a cutoff instant t′ ∈ R

+ such that the local esti-mates E(t′) := Ei(t

′)|i ∈ I, computed by the proposedprocedure above (under either global consistency or localconsistency) based on a symptom tuple W(t′) at t′, has theproperty that for each i ∈ I , πi(T (t)) ∈ Ei(t′).

Since πi(T (t)) (i ∈ I) stands for the actual trace ofcomponent Li at t, Prop. 5.1 essentially says that the actualtrace of Li can always be captured by the estimator sooneror later, if only time mistach and observation distortion areunder consideration.

We now use a simple example to illustrate the proce-dure. Suppose a system consists of two components: L1 =

adbb, ba and L2 = cbcb, where Σ1 = a, b, d, Σ2 =b, c, and Σ1o = b, d, Σ2o = b. Suppose a global evo-lution T is described as follows:

T (t) =

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

a, c if 0 ≤ t < 1ad, c if 1 ≤ t < 3adb, cb if 3 ≤ t < 4adb, cbc if 4 ≤ t < 6adbb, cbcb if t ≥ 6

The local clock T1(t) = t + 0.1, and the local clock T2(t) =t + 0.5. Suppose the cutoff instant is t = 6.5. If the ob-servation channel is flawless, then the locally time-stampedlocal symptom for L1 should be (dbb, 6.5), which is associ-ated with the permissive trace tuple

T (T−11 (6.5)) = T (6.4) = adbb, cbcb

But unfortunately, the observable event d and the first b getlost. Thus, the locally time-stamped local symptom becomes(b, 6.5). Suppose the local estimator associated with the com-ponent L2 is lucky and no observation gets lost (here the sameevent b can be measured by two different channels - one foreach local estimator). Then the locally time-stamped localsymptom for L2 is (bb, 6.5), which is associated with the per-missive trace tuple

T (T−11 (6.5)) = T (6.0) = adbb, cbcb


So the symptom tuple with local time-stamps is W(t) =(b, 6.5), (bb, 6.5). If we assume observation channels areflawless, then the preliminary local estimates are:

M1(6.5) = b, ba

M2(6.5) = cbcb

After communication, A1(6.5) = A2(6.5) = ∅. Finally,the local estimates are: E1(6.5) = E2(6.5) = ∅. Clearly,each local estimator fails to track the actual trace of its tar-get local component. If we redo this example with the as-sumption that the observation channels may be distorted,and both observable events b and d may get lost, namelyf1(b) = f2(b) = ε, b and f1(d) = ε, d. The resultingpreliminary local estimates are:

M1(6.5) = b, ba, adb, adbb

M2(6.5) = cbcb

Then communication results in A1(6.5) = adbb andA2(6.5) = cbcb. Finally, the local estimates are:E1(6.5) = adb, adbb and E2(6.5) = cbcb. This time,the actual trace of each local component is within thecorresponding local estimate.

Proposition 5.2 Given a symptom tuple O = ui|i ∈ I,suppose E is computed by the algorithm with asynchronouslocal clocks, E by the algorithm with asynchronous localclocks and distorted observation channels, and both tuples oflocal estimates are under the same type of consistency (eitherglobal or local). Then E ≤ E .

Prop. 5.2 says that if the channels are distorted, namely theaccuracy of observations does not exist any more, then thequality of trace estimation degrades further, compared withthe case when only asynchronous local clocks are presented.If we put Prop. 4.2 and Prop. 5.2 together, then we have E ≤E ≤ E , which vividly illustrates how the degradation of thequality of estimation is related to the amount of informationobtained by a local estimator.

6 ConclusionThis paper explores the quality of trace estimation underdifferent assumptions about the implementation facilities. Itshows that, with the synchronous local clocks and flawlessobservation channels, we can get the best quality of esti-mation in the proposed distributed framework. If we dropthe assumption about the synchronization of local clocks,then the quality of estimation degrades. Nevertheless, localestimators can still track true traces of local components, andbecome insensitive to the delay of the arrivals of observableevents. If we assume that observation channels are dis-tortable, then the quality of estimation degrades further. Butimplementation of trace estimation becomes very flexible inthe sense that it allows a local estimator to be switched onand off freely. The reason is that a local estimator can treatthose local symptoms appearing before it is switched on asgetting lost. At this point we can see a tradeoff between the

quality of estimation and the flexibility of implementation.

There are plenty of other application issues related to thedistributed trace estimation that are not addressed in thispaper, e.g. how to fulfill communication through communi-cation channels that have bandwidth limitations, or packagedelays and droppings, etc. They are part of our ongoingresearch.

Acknowledgement

This work is supported by the ITEA grant number 04003(and the Dutch national Senter grant number 18044021) forthe EU-ITEA Trust4All Project.

References[1] R. Debouk, S. Lafortune, and D. Teneketzis. Coordi-

nated decentralized protocols for failure diagnosis ofdiscrete event systems. Discrete Event Dynamic Sys-tems: Theory and Applications, 10(1/2):33–86, January2000.

[2] E. Fabre, A. Benveniste, and C. Jard. Distributed di-agnosis for large discrete event dynamic systems. InProc. 15th IFAC World Congress, Barcelona, Spain,July 2002.

[3] E. Fabre, A. Benveniste, S. Haar, and C. Jard. Dis-tributed monitoring of concurrent and asynchronoussystems. Journal of Discrete Event Dynamic Systems,15(1):33–84, March 2005.

[4] S. MacLane and G. Birkhoff. Algebra. Chelsea, NewYork, 1988.

[5] M. Sampath, S. Lafortune, and D. Teneketzis. Activediagnosis of discrete-event systems. IEEE Transactionson Automatic Control, 40:908–929, July 1998.

[6] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamo-hideen, and D. Teneketzis. Failure diagnosis usingdiscrete-event models. IEEE Trans. Control SystemsTechnology, 4(2):105–124, 1996.

[7] R. Su and W.M. Wonham. Decentralized fault diagnosisfor discrete-event systems. In Proc. 2000 CISS, pagesTP1:1–6, Princeton, New Jersey, March 2000.

[8] R. Su and W.M. Wonham. Global and local consis-tencies in distributed fault diagnosis for discrete-eventsystems. IEEE Transactions on Automatic Control,50(12):1923–1935, 2005.

[9] Rong Su. Distributed Diagnosis for Discrete-EventSystems. PhD Thesis, ECE Dept., Univ. of Toronto,URL:www.control.utoronto.ca/∼surong/SR.zip, 2004.

[10] W. M. Wonham. Supervisory Control of Discrete-EventSystems. Systems Control Group, Dept. of ECE, Univer-sity of Toronto. URL: www.control.utoronto.ca/DES,2004.

[11] S. Hashtrudi Zad, R.H. Kwong, and W.M. Wonham.Fault diagnosis in discrete-event systems: Frameworkand model reduction. IEEE Transactions on AutomaticControl, 48(7):1199–1212, 2003.


[12] Y. Zhang and A. Mackworth. Parallel and distributedalgorithms for finite constraint satisfaction problems. InProc. 3rd IEEE Symposium on Parallel and DistributedProcessing, pages 394–397, 1991.

[13] Makoto Yokoo, Edmund H. Durfee, Toru Ishida, andKazuhiro Kuwabara. The distributed constraint satis-faction problem: formalization and algorithms. IEEETransactions on Knowledge and Data Engineering,10(5):673–685, 1998.

[14] Z. Collin, R. Dechter, and S. Katz. On the feasibilityof distributed constraint satisfaction. In Proc. 12th In-ternational Joint Conference on Artificial Intelligence,pages 318–324, 1991.

[15] Gianfranco Lamperti, Marina Zanella, and PaoloPogliano. Diagnosis of active systems by automata-based reasoning techniques. Applied Intelligence,12(3):217–237, May 2000.

[16] Yannick Pencole and Marie-Odile Cordier. A formalframework for the decentralised diagnosis of large scalediscrete event systems and its application to telecommu-nication networks. Artificial Intelligence, 164(1-2):121-170, May 2005.

[17] P. Baroni, G. Lamperti, P. Pogliano, and M. Zanella. Di-agnosis of large active systems. Artificial Intelligence,110(1):135–183, May 1999.


Qualitative Domain Abstractions for Time-Varying Systems:an Approach based on Reusable Abstraction Fragments∗

Gianluca Torta and Pietro TorassoDipartimento di Informatica, Universita di Torino

C.so Svizzera 185, 10149 Torino (Italy)torta,[email protected]

Abstract

The present paper addresses the problem of au-tomatic abstraction of system variables domainsin time-varying systems, where the domain theoryis represented by a set of undirected constraintsamong qualitative system variables. The main goalis to enable the synthesis of abstract models capableof deriving fewer and more general temporal diag-noses than the ones produced at the detailed levelwithout loosing any diagnostic information.By taking into account the degree of system observ-ability, a formal notion of indiscriminability amongbehavioral modes of a component in a temporalwindow is introduced and is used as the startingpoint for computing reusable fragments of abstrac-tions for the domains of the system variables. Theabstraction fragments are associated with applica-bility conditions defined in terms of constraints onthe operating conditions of the system and othercontextual information. Once partial informationconcerning such operating conditions is known forspecific (classes of) diagnostic problems, an ab-stract model is automatically synthesized on the flyby composing the abstraction fragments that are ap-plicable in the considered context.The paper describes the role of Ordered Binary De-cision Diagrams for efficiently computing the ab-straction fragments and representing in a compactway their applicability conditions.

1 IntroductionWhile several approaches to system model abstraction havebeen developed in the MBD community where the domainexpert provides the abstractions (starting from the pioneer-ing work in [Mozetic, 1991] and recent improvements pro-posed e.g. in [Provan, 2001] and [Chittaro and Ranon, 2004]),some recent proposals have started to deal with automatic ab-straction of static system models ([Torta and Torasso, 2003],[Sachenbacher and Struss, 2004]) . The driving criterion forperforming an abstraction is the level of observability of the

∗This research has been partially funded by MIUR under project2004012477 (2004).

system and in particular the impossibility of discriminatingamong behavioral modes of a component.In fact, system observability has a deep impact on the numberof diagnostic solutions: a low level of observability producesa large number of solutions and in general requires more timefor computing the solutions. Using a suitable abstract modelfor diagnosis can significantly reduce the number of returneddiagnoses without missing any relevant information. More-over, the returned diagnoses, by being more abstract, are usu-ally more understandable for a human.

While previous work on automatic abstraction has mainlybeen made on static models, it is well known that the tempo-ral dimension often plays a major role in diagnostic problemsolving, both from a methodological and applicative point ofview. This paper addresses the problem of automatic abstrac-tion of time-varying systems. Such models are able to cap-ture the possible evolutions of the behavioral modes of sys-tem components over time; they thus have an intermediateexpressive power between static models and fully dynamicmodels, where arbitrary status variables can be modeled 1.

The techniques defined for the automatic abstraction ofstatic models have to be significantly extended when the tem-poral dimension is taken into consideration: for example,there is the need of defining indiscriminability over time bytaking into consideration possible evolutions of the systemand the observations available at different time instants. Asshown in the field of diagnosability analysis (e.g. [Consoleet al., 2002], [Cimatti et al., 2003]), the operating conditionsof the system play a major role since different operating con-ditions (e.g. an engine system may be in operating condi-tions shut down, warming up, cruising) imply differences inthe system behavior so relevant that it is often impossible todraw general conclusions (such as “the system is diagnos-able” or “two behavioral modes of a component c are indis-criminable”) that hold independently of them.

One way of dealing with the problem of operating condi-tions is to analyze the system and to synthesize an abstractmodel after the operating conditions have been fixed (this isessentially the approach taken in [Torasso and Torta, 2005]).Since operating conditions can (and usually do) change over

1An in-depth analysis of different types of temporal systems isreported in [Brusoni et al., 1998] where the main characteristics oftime-varying systems are singled out.


time, this approach requires the abstraction algorithm to berun on-line every time the operating conditions of the moni-tored system change2.

The main aim of this paper is that of developing a muchmore flexible approach where most of the effort of computingabstractions is made off-line while the ability of synthesizingon-line the “right” abstraction for the current operating con-ditions is retained.In particular, we compute off-line a set of reusable abstrac-tion fragments; each fragment specifies under what operatingconditions of the system two behavioral modes of a compo-nent can be merged into a single abstract behavioral mode. Inorder to compute the abstraction fragments, we define a for-mal notion of indiscriminability among behavioral modes ofa component in a temporal window; this notion of indiscrim-inability takes into account the degree of system observabilityand constraints on operating conditions.The computation of abstraction fragments does not requireany knowledge about the actual operating conditions of thesystem and can therefore be performed off-line. As soon as(partial) information concerning the current operating condi-tions is known, an on-line process efficiently builds a systemabstraction by combining suitable abstraction fragments.

The paper is organized as follows. In section 2 we for-malize the notions of time-varying system model, diagnosticproblem and diagnosis and in section 3 a precise definitionof abstraction mapping is provided based on the notion oftemporal indiscriminability. In section 4 we describe howthe declarative notions introduced in 3 can be made opera-tional. Section 5 describes the role of Ordered Binary Deci-sion Diagrams for efficiently computing the abstraction frag-ments and representing in a compact way their applicabilityconditions; some preliminary results are reported concerningthe application of the approach to a simplified version of themodel of a propulsion system. Finally, in section 6 we com-pare our work to related papers and conclude.

2 Time-Varying DiagnosisBefore presenting our approach to abstraction, it is neces-sary to precisely define what we mean by diagnosis of time-varying systems. We start from the definition of Time-VaryingSystem Description.

Definition 2.1 A Time-Varying System Description (TVSD)is a tuple (SV , DT, δ) where:

- SV is the set of discrete system variables partitioned inP (system ports), C (system components) and I (inter-nal variables). The set of system ports is partitioned intoPexo (exogenous ports) and Pend (endogenous ports). SetO ⊆ Pend represents system observables while a completeinstantiation S of C variables represents a system status.We will denote with D(v) the finite domain of variablev ∈ SV; in particular, for each c ∈ C, D(c) consists ofthe list of possible behavioral modes for c (an ok mode andone or more fault modes)

2Computing off-line an abstract model for each of the possibleoperating conditions is in general unfeasible due to their number.

- DT (Domain Theory) is a set of propositional logical for-mulas over instantiations of variables in SV . We requirethat any instantiation of variables C ∪ Pexo is consistentwith DT

- δ is the System Transition Relation mapping the valuesof the variables in Pexo ∪ C at time t to the values ofthe variables in C at time t+1 (such a mapping is usuallynon deterministic). More precisely δ can be obtained asδ1 . . . δn

3 where δi (Component Transition Relation)represents the possible evolutions of the behavioral modesof component ci. A generic element τi ∈ δi (ComponentTransition) is a pair (Xt ∪ ci,t(bm), ci,t+1(bm

′)) where Xt

is an instantiation of Pexo variables at time t and ci,t(bm),ci,t+1(bm

′) are instantiations of variable ci at times t andt+1 respectively.

In the following, we will assume that the values Xt of Pexo

variables are known at each time point during the diagnosisof the system.When useful, we’ll denote a System Transition τ = (Xt ∪St, St+1) as τ = St

Xt→ St+1 where St, St+1 are systemstates at times t and t + 1 respectively. Given a system sta-tus S and an instantiation X of the Pexo variables, the setof the possible successor states of S given X is defined as

next(S,X ) = S′ s.t. SX→ S′ ∈ δ.

Operator next can be easily extended to apply to a setS = S1, . . . , Sm of system states by taking the union ofnext(Si,X ), i = 1, . . . , m.Finally, let θ = (τ0, . . . , τw−1) be a sequence of system tran-

sitions s.t. τi = SiXi→ Si+1, i = 0, . . . , w − 1. We say that θ

is a (feasible) system trajectory.In many practical cases, the evolutions of the behavioralmodes in time-varying systems do not depend on the valuesof the exogenous ports; in such cases a transition can be ex-pressed just as τ = (St, St+1) and a trajectory can be ex-pressed as a sequence (S0, . . . , Sw).Finally, since δ can be partitioned in δ1, . . . , δn we can easilyextend the notion of trajectory to that of component trajectoryθc = (τc,0, . . . , τc,w−1) for any component c ∈ C.

We are now ready to formalize the notion of Time-VaryingDiagnostic Problem.

Definition 2.2 A Time-Varying Diagnostic Problem is a tu-ple TVDP = (TVSD, S0, w, σ) where:- TV SD is a Time-Varying System Description- S0 is the set of possible initial states (i.e. system states attime 0)- w is an integer representing the size of the time window[0, . . . , w] over which the diagnostic problem is defined- σ is a sequence (X0,Y0, . . . ,Xw,Yw) where the Xis andthe Yis are instantiations of the Pexo and O variables re-spectively at times 0, . . . , w. Sequence σ then represents theavailable observed information about the system in time win-dow [0, . . . , w].

Given a time-varying diagnostic problem TV DP , we saythat a system status S is instantaneously consistent (denotedconsistentinst) at time t, t ∈ 0, . . . , w if:

3Symbol denotes the natural join of two relations.


PUMP1

PUMP2

PIPE1

PIPE2

∆s2 ∆rSRext

∆s1∆rSRW

∆rSRZ

∆fSRext

JOIN1

Z Y

XW

∆fSR1

∆fSR2

Figure 1: The sample Hydraulic System.

DT ∪ Xt ∪ Yt ∪ S ⊥Instantaneous consistency of S expresses the fact that the in-stantiation S of C variables is logically consistent with thecurrent values of Pexo and O variables, under constraints im-posed by DT .

The following definition formalizes the notions of beliefstate and temporal diagnosis.

Definition 2.3 Let TVDP= (TVSD, S0, w, σ) be a Time-Varying Diagnostic Problem. We define the belief state B t

at time t (t = 0, . . . , w) recursively as follows:

- B0 = S0 s.t. S0 ∈ S0 andS0 is consistentinst at time 0

- Bt = St s.t. St ∈ next(Bt−1,Xt−1) andSt is consistentinst at time t, t = 1 . . . w

We say that any system status Sw ∈ Bw is a (time-varying)diagnosis for TV DP .

In order for St to belong to Bt, then, it must be consistentwith the current observed information about the system (i.e.Xt and Yt) and with the prediction next(Bt−1,Xt−1) basedon the past history.

Example 2.1 As a running example throughout the paper wewill consider a simple hydraulic system involving two pumpsPUMP1 and PUMP2, each one connected in series with apipe (PIPE1 and PIPE2 respectively); the two subsystemsare connected together via a join JOIN1. The schematicsof the hydraulic system in reported in Figure 1.The Time-Varying System Description TV SDH of thehydraulic system can be derived by instantiating genericmodels of pump, pipe and join. The Domain Theory ofeach component is reported in Figure 24 and makes useof undirected qualitative equations involving variablesrepresenting qualitative deviations [Struss et al., 1996];the Transition Relation δ can be expressed in terms of thecomponent transition relations reported in Figure 2.Note that the qualitative equations of the static model canbe straightforwardly expressed in propositional logic overmulti-valued variables; for example equation ∆fout = ∆fin

for Pipe(ok) can be expressed as:

Pipe(ok) ∧ [(∆fout(−) ∧ ∆fin(−)) ∨ (∆fout(0) ∧ ∆fin(0))∨∨(∆fout(+) ∧ ∆fin(+))]

The set of components C consists of PUMP1, P IPE1,PUMP2, P IPE2, JOIN1, FSENS1, FSENS2,

4The static models of the components are derived by the onesreported in [Console et al., 2002].

Pipe(ok) ∆fout = ∆fin ; ∆rin = ∆rout

Pipe(lk) ∆fout = ∆fin ⊕ − ; ∆rin = ∆rout ⊕ −Pipe(pc) ∆fout = ∆fin ; ∆rin = ∆rout ⊕ +Pipe(br) ∆fout = − ; ∆rin = −Pipe(cl) ∆fout = − ; ∆rin = +Pump(ok) ∆fout = ∆s ∆r ⊕ ∆fin

Pump(up) ∆fout = ∆s ∆r ⊕ ∆fin ⊕ −Pump(op) ∆fout = ∆s ∆r ⊕ ∆fin ⊕ +Pump(bl) ∆fout = −Join(ok) ∆fout = ∆fin1 ⊕ ∆fin2 ; ∆rout = ∆rin1 = ∆rin2

FSensor(ok) ∆f = − ⇒ ∆fSR = −FSensor(ok) ∆f = 0 ⇒ ∆fSR = 0FSensor(ok) ∆f = + ⇒ ∆fSR = +RSensor(ok) ∆r = − ⇒ ∆rSR = off-nominalRSensor(ok) ∆r = 0 ⇒ ∆rSR = nominalRSensor(ok) ∆r = + ⇒ ∆rSR = off-nominal

OK

UP

OP

BL

PUMP

OK

PC

LK

CL

BR

PIPE

Figure 2: Sample Components DT s and δs.

RSENSW, RSENSZ, RSENSext, FSENSext,where RSENSW, RSENSZ and RSENSext are sensorsreporting resistances while FSENS1, FSENS2 andFSENSext are sensors reporting flows. The positions ofthe sensors are indicated in Figure 1 and their models arereported in Figure 2.The pumps have four behavioral modes (ok, overpump-ing, underpumping, blocked, in the following denoted asok,op,up,bl respectively) and all the faulty modes repre-sent permanent faults. The pipes have five modes: ok,leaking, broken, partially clogged, clogged, denoted withok,lk,br,pc,cl. The join and the sensors are supposed to bealways ok.The set of exogenous ports Pexo consists of the qualitativedeviations ∆s1 and ∆s2 associated with the control signalsof the two pumps (that influence the quantity of fluid pumpedby the pumps). It is worth noting that the values of all of thePexo and I variables belong to −, 0, +; we assume thatalso the sensor readings concerning flows are at the samelevel of granularity −, 0, +, while the sensor readings con-cerning resistances provide a lower level of discrimination,i.e. they discriminate just between nominal and off-nominal.

A last remark concerns the need of a relational representa-tion for capturing the domain model. In fact, in many caseswe have non determinism in a qualitative equation: let usconsider the equation of the pump in the ok mode, when∆s = +, ∆r = + and ∆fin = 0; we can have that ∆fout

may be − or 0 or +, so a functional model is not sufficientand a relational one has to be adopted.

3 Abstractions DefinedAs pointed out in the introduction, we drive automaticabstraction of time-varying system models mainly by ex-ploiting system observability and by tailoring abstractions tooperating conditions that satisfy certain constraints.


While in the work of [Console et al., 2002] on diagnosabil-ity analysis operating conditions are identified by specialoperating modes of the components, here we define anoperating condition as a (partial) instantiation of Pexo

variables. Constraints on operating conditions thus identifysubsets XS of all the instantiations of Pexo variables thatsatisfy the constraints; we denote such subsets as context sets.

3.1 Characterizing IndiscriminabilityThe following definition, which introduces the notion of in-stantaneous indiscriminability among two instantiations of acomponent c ∈ C, is explicitly based on both of the drivingfactors that we use for abstraction.

Definition 3.1 Let c ∈ C be a system component and XS bea context set. We say that two instantiations c(bm), c(bm′)of c are XS-indiscriminableinst iff for any instantiation Cc ofC\c the following holds5:∀X ∈ XS : ΠO(DT X Cc c(bm)) =

ΠO(DT X Cc c(bm′))

The above definition of indiscriminability requires that thevalues of the observations are exactly the same when the com-ponent c is in the behavioral modes bm and bm ′ and this musthold for any possible instantiation of the behavioral modesof the other components and for any X ∈ XS . Note thattheXS-indiscriminabilityinst relation induces a partition intoXS-indiscriminabilityinst classes of the set of possible be-havioral modes of c, i.e. of D(c).

We now extend the notion of indiscriminability by takinginto account the temporal dimension. For clarity of exposi-tion, from now on we assume that the evolutions of the behav-ioral modes in the time-varying system do not depend on theexogenous ports so that a system trajectory can be expressedas θ = (S0, . . . , Sw); however, all the discussions that followcan be easily extended to apply to the more general case.

Definition 3.2 Let c ∈ C be a system component and XS bea context set. We say that two instantiations c(bm), c(bm′) ofc are XS-indiscriminablek, k ≥ 1 iff:

- given a trajectory θc = (c(bm0), . . . , c(bmk)) of compo-nent c s.t. bmt = bm for some t ∈ 0, . . . , k there existsa trajectory θ′c = (c(bm′

0), . . . , c(bm′k)) s.t. bm′

t = bm′

and c(bmi), c(bm′i) are XS-indiscriminableinst for i =

0, . . . , k

- given a trajectory θ′c = (c(bm′

0), . . . , c(bm′k)) of compo-

nent c s.t. bm′t = bm′ for some t ∈ 0, . . . , k there exists

a trajectory θc = (c(bm0), . . . , c(bmk)) s.t. bmt = bmand c(bmi), c(bm′

i) are XS-indiscriminableinst for i =0, . . . , k

It is easy to see that the definition above puts symmetric con-ditions on the trajectories θc and θ′c. For each of them it is nec-essary to find an equivalent trajectory where at each time in-stant of the temporal window the behavioral modes of the twotrajectories are XS-indiscriminableinst according to Defini-tion 3.1.

5Symbol ΠS denotes the projection of a relation on a set S ofvariables.

Saying that c(bm), c(bm′) are XS-indiscriminable1 isequivalent to saying that they are XS-indiscriminable inst.When c(bm), c(bm′) are XS-indiscriminablek for any k, wesay that they are XS-indiscriminable.Note that also the XS-indiscriminabilityk relation induces apartition into XS-indiscriminabilityk classes of D(c).If we require that the two conditions of the definition aboveonly hold for trajectories where c(bm), c(bm ′) are thefirst (last) component status of the trajectory we obtain aweaker notion of indiscriminability that we denote as XS-indiscriminabilityfut

k (XS-indiscriminabilitypastk ). It is pos-

sible to prove that the following property holds.

Property 3.1 Instantiations c(bm), c(bm ′) of c ∈ Care XS-indiscriminablek iff they are both XS-indiscriminablefut

k and XS-indiscriminablepastk

Example 3.1 Let us consider the hydraulic system of Fig-ure 1 and let’s suppose that the constraints on operating con-ditions allow only the following set XS of instantiations ofPexo variables:

∆s1 = − , ∆s2 = +∆s1 = − , ∆s2 = 0∆s1 = − , ∆s2 = −

i.e. ∆s1 is negative. Given this context set, we have e.g. thatthe behavioral modes ok and up of PUMP1 and the modes brand cl of PIPE1 are XS-indiscriminableinst.

By taking into consideration the temporal dimension the in-discriminability among the behavioral modes can decrease.Indeed, it turns out that ok and up of PUMP1 are notXS-indiscriminable while br and cl of PIPE1 are XS-indiscriminablek for any value of k, so they are also XS-indiscriminable.

3.2 Abstraction Mappings

Let us now introduce the notion of abstraction mapping.

Definition 3.3 Given a Time-Varying System DescriptionTV SD = (SV , DT, δ) and a context set XS , an abstrac-tion mapping AM associates with each component c i ∈ Cthe partition Γi induced by the XS-indiscriminable relationon D(ci), i.e.:

AM(ci) = Γi = γi,1, . . . , γi,kiwhere D(ci) =

⋃j=1...,ki

γi,j and γi,j ∩ γi,k = ∅, j = k

The abstraction mappingAM(ci) contains as many elementsas the number of equivalence classes induced by the XS-indiscriminability relation on the behavioral modes of com-ponent ci. This implies that each new behavioral mode atthe abstract level represents the disjunction of the behavioralmodes that have been put in an equivalence class γ i,j .

We can apply an abstraction mapping AM to TV SD inorder to get an abstract model TV SDA.

Definition 3.4 Let TV SD = (SV , DT, δ), XS be a con-text set and AM be the abstraction mapping induced byTV SD and XS . We define the abstraction TV SDA =(SVA, DT A, δA) of TV SD as follows:


- for each ci ∈ C s.t. AM(ci) = Γi =γi,1, . . . , γi,ki, there is a component cA

i in CA with do-main νi,1, . . . , νi,ki, where νi,j is a fresh value whosepurpose is to replace class γi,j

- each formula ϕ ∈ DT is replaced by a formula ϕA ∈DT A where occurrences of ci(bm) s.t. bm ∈ γi,j are sub-stituted by occurrences of cA

i (νi,j)

- each transition τ = (St, St+1) ∈ δ is replaced by a tran-sition τA ∈ δA where occurrences of ci(bm) s.t. bm ∈ γi,j

are substituted by occurrences of cAi (νi,j)

Example 3.2 Let’s consider as context set the setXS definedin Example 3.1.With this context set the following behavioral modes of PIPE1result to be XS-indiscriminable: lk and pc, br and cl.The abstraction mappingAM induced by XS partitions thebehavioral modes of PIPE1 in ok, lk, pc, br, cl. Apply-ing AM produces an abstraction where PIPE1 is replacedby the abstract component PIPE1A with three abstract be-havioral modes that we can denote as okA, lk pcA, br clA.Similar abstractions can be derived for PIPE2.

The abstraction mappings we have defined exhibit an inter-esting property.

Property 3.2 LetAM and TV SDA be the abstraction map-ping and system description induced by TV SD and XSas defined above. Given a Time-Varying Diagnostic Prob-lem TV DP = (TV SD, S0, w, (X0,Y0, . . . ,Xw,Yw)) s.t.∀Xt ∈ σ,Xt ∈ XS and the corresponding abstract Time-Varying Diagnostic Problem TV DP A = (TV SDA, SA

0 , w,σ), D is a diagnosis for TV DP iff its abstraction DA ac-cording to AM6 is a diagnosis for TV DP A.

This property tells us that, if operating conditions are con-strained by context set XS , we can completely replaceTV SD with TV SDA in our diagnostic reasoning withoutloosing any relevant diagnostic information (this importantresult mirrors a similar property estabilished for abstractionsof static systems in [Torta and Torasso, 2003]).

3.3 Maximal Context SetsSo far we have characterized indiscriminability in terms ofa specific context set. We are now interested in computingthe maximal context sets that make pairs of behavioral modesbm, bm′ of component c indiscriminable; such sets representall the possible operating conditions where abstraction of themodes is admissible without loosing diagnostic information.

Definition 3.5 Let c ∈ C and bm, bm′ ∈ D(c). We defineXSmax,k(c, bm, bm′) to be a context set s.t. c(bm), c(bm′)are XSmax,k(c, bm, bm′)-indiscriminablek and for any XSs.t. c(bm), c(bm′) areXS-indiscriminablek it must beXS ⊆XSmax,k(c, bm, bm′).

It is not difficult to prove that for any c ∈ C, bm, bm ′ ∈D(c) and k there always exists XSmax,k(c, bm, bm′) satis-fying the definition above. As two special cases, there existXSinst

max(c, bm, bm′) (i.e. k = 1) and XSmax(c, bm, bm′)

6SA0 and DA are obtained from S0 and D with substitutions as

the ones of definition 3.4.

(i.e. k unbounded). Furthermore, if we replace theXS-indiscriminablek relation in the definition above withthe XS-indiscriminablefut

k (XS-indiscriminablepastk ) re-

lation, we obtain a definition of XS futmax,k(c, bm, bm′)

(XSpastmax,k(c, bm, bm′)).

4 Computing AbstractionsIn the previous section we have provided a declarativecharacterization of abstraction mappings while in this sectionwe describe a method for the actual computation of suchmappings. As stated in the introduction, our approachconsists of two parts: the first one produces a set of reusableabstraction fragments off-line and the second one computeson-line an abstraction mapping as soon as partial knowledgeon the operating conditions of the system is provided.

Off-line Computation of Abstraction Fragments. Thisprocess involves the following steps:

- computation of instantaneous maximal context setsXSinst

max(c, bm, bm′)- computation of XS fut

max(c, bm, bm′) andXSpast

max(c, bm, bm′) by exploiting setsXS instmax(c, bm, bm′)

and Transition Relation δ

- computation of XSmax(c, bm, bm′) as the intersection ofXSfut

max(c, bm, bm′) and XSpastmax(c, bm, bm′) (as a direct

consequence of Property 3.1)

Instantaneous maximal context sets XS instmax(c, bm, bm′) can

be computed with the algorithm shown in Figure 3 7. Becauseof lack of space we do not describe the algorithm in full detail;instead, we discuss the main ideas underlying its design.

According to Definition 3.1, an instantiation X of Pexo

variables belongs to XS instmax(c, bm, bm′) iff for each instan-

tiation Cc of C\c:ΠO(DT X Cc c(bm)) = ΠO(DT X Cc c(bm′))

Note that, due to non-determinism, both sides of the equa-tion above can be sets, i.e. even if we focus on a particularinstantiation of C and Pexo variables, there may be severalconsistent instantiations of the O variables.A straightforward way to compute XS inst

max(c, bm, bm′)would be to consider each instantiation X of Pexo and thenfor each instantiation Cc of C\c to ckeck whether the equa-tion above holds. Such a naıve implementation would how-ever be prohibitively costly; in our algorithm, for efficiencyreasons (that will become claerer in section 5), we completelyavoid the explicit enumeration of the instantiations of subsetsof variables as well as the enumeration of the tuples of therelations.

In lines 1 through 9 the algorithm (making use of fourcopies Obm, Obm′ , O∗

bm and O∗bm′ of the observable variables

O) builds two relations DT →bm,bm′ and DT←

bm,bm′ .A tuple in relation DT →

bm,bm′ associates an instantiation Xof Pexo and Cc of C\c with an instantiation of the copies

7Symbol ρS→T denotes the renaming of variables in S with thecorresponding variables in T .


1 DT = ΠPexo∪C∪O(DT )2 DTbm = (ρO→ObmDT ) c(bm)3 DTbm = DTbm equiv(Obm, O∗bm)4 DTbm′ = (ρO→Obm′DT ) c(bm′)5 DTbm′ = DTbm′ equiv(Obm′ , O∗bm′)6 DTbm,bm′ = DTbm DTbm′7 DT→bm,bm′ = ΠPexo∪C∪Obm∪Obm′ (DTbm,bm′ )8 DT←bm,bm′ = ΠPexo∪C∪O∗

bm∪O∗

bm′ (DTbm,bm′ )

9 DT←bm,bm′ = ρO∗bm′→Obm,O∗

bm→Obm′ DT R

bm,bm′

10 XSdiff = ΠPexo(DT→bm,bm′\(DT→bm,bm′ ∩ DT←bm,bm′ ))11 XSmax(c, bm, bm′) = ΠPexo(DT→bm,bm′ )\XSdiff

Figure 3: Computation of XS instmax(c, bm, bm′).

Obm and Obm′ of the observables O. In particular, the instan-tiation of Obm must be consistent with X , Pexo and c(bm)while the instantiation of Obm′ must be consistent with X ,Pexo and c(bm

′).As for relation DT ←

bm,bm′ it is analogous to DT →bm,bm′ but it

is somowhat specular: given one of its tulpes where Pexo andC\c have values X and Cc respectively, the instantiationof Obm is consistent with X , Pexo and c(bm

′) (instead ofc(bm)) while the instantiation of Obm′ is consistent with X ,Pexo and c(bm) (instead of c(bm′)).

It is not difficult to see that the instantiationsX of Pexo thatbelong to XS inst

max(c, bm, bm′) are exactly the ones for whichthe associated restrictions of DT ←

bm,bm′ and DT→bm,bm′ (i.e.

the sets of tuples where X appears) are identical. Lines 10and 11 of the algorithm extract these instantiations and storethem in the result XS inst

max(c, bm, bm′).As stated above, maximal context sets XSmax(c, bm, bm′)

can be computed from sets XS instmax(c, bm, bm′) by tak-

ing into consideration the Transition Relation δ. Figure 4shows how XSfut

max(c, bm, bm′) is computed. RelationXSpast

max(c, bm, bm′) is computed in a similar way, and thenthe two are intersected to obtain XSmax(c, bm, bm′).The while loop is intended to incrementally extend thesize of the time window over which the maximal context setXSbm,bm′ is evaluated (at iteration 0 outside the loop such amaximal context set is just XS inst

max(c, bm, bm′)).In particular, at iteration k we consider the sets Bbm and Bbm′

of behavioral modes reachable in k steps from bm and bm ′respectively; we exit the loop after the first iteration whereneither Bbm nor Bbm′ have changed8.At each iteration, following Definition 3.2, we disregard fromXSbm,bm′ the assignments to Pexo variables for which atleast one behavioral mode in Bbm is instantaneously discrim-inable from all the behavioral modes in Bbm′ or vice-versa(lines 5 − 7). It is easy to see that an instantiation X of Pexo

belongs to XS bm only if, for all bm in Bbm we have beenable to find at least one bm′ in Bbm′ s.t. bm, bm′ are instan-taneously indiscriminable in context X ; XS bm′ is analogous(but inverts the roles of Bbm and Bbm′ ).

Lines 8, 9 conclude the loop body by updating B bm and

8This check for a fixed-point is valid under the assumption thatBbm and Bbm′ are monotonically non-decreasing and it is guaran-teed when each behavioral mode has at least a transition to itself.

1 Boldbm = bm, Bold

bm′ = bm′2 XSbm,bm′ = XSinst

max(c, bm, bm′)3 Bbm = next(Bold

bm), Bbm′ = next(Boldbm′)

4 while Bbm = Boldbm ∨ Bbm′ = Bold

bm′5 XSbm =

⋂bm∈Bbm

⋃bm′∈Bbm′ XSinst

max(c, bm, bm′)

6 XSbm′ =⋂

bm′∈Bbm′⋃

bm∈BbmXSinst

max(c, bm, bm′)7 XSbm,bm′ = XSbm,bm′ ∩ XSbm ∩ XSbm′

8 Boldbm = Bbm, Bbm′ = Bbm′

9 Bbm = next(Boldbm), Bbm′ = next(Bold

bm′)10 XSfut

max(c, bm, bm′) = XSbm,bm′

Figure 4: Computation of XS futmax(c, bm, bm′).

ComputeAM(TVSD, XS, [XSmax(c, bm, bm′)])AM = ∅, n = |C|Γi = ∅ (i = 1, . . . , n)γi,j = bmj (i = 1, . . . , n, j = 1, . . . , |D(ci)|)foreach ci ∈ C

foreach j ∈ 1, . . . , |D(ci)| s.t. γi,j = ∅foreach k ∈ (j + 1), . . . , |D(ci)| s.t. γi,k = ∅

if (XS ⊆ XSmax(ci, bmj , bmk))γi,j = γi,j ∪ bmkγi,k = ∅

Γi = Γi ∪ γi,jAM = AM ∪ (ci, Γi)

End

Figure 5: Sketch of the ComputeAM() Algorithm.

Bbm′ .On-line Computation of Abstraction Mappings. Let[XSmax(c, bm, bm′)] be the data structure containingXSmax(c, bm, bm′) for each c ∈ C and bm, bm′ ∈ D(c)which has been computed off-line. We are now in the posi-tion of exploiting it to compute an abstraction mappingAMwhen a specific context set XS is given; such a context setrepresents partial run-time knowledge about operating condi-tions.The algorithm for the computation of AM is shown in Fig-ure 5. Essentially, it considers each pair of behavioral modesbmj , bmk of each component ci and checks whether the givencontext set is a subset of the maximal context set associatedwith bmj , bmk; if yes, AM puts bmj and bmk in the sameclass of the partition of D(ci) associated with ci.

Example 4.1 In example 3.2 we have shown a context setXS for which modes lk and pc of PIPE1 resulted XS-indiscriminable. The maximal context set for which the twomodes are XS-indiscriminable is actually larger and in-cludes the following instantiations:

∆s1 = − , ∆s2 = +∆s1 = − , ∆s2 = 0∆s1 = − , ∆s2 = −∆s1 = 0 , ∆s2 = −∆s1 = + , ∆s2 = −

This set is also the maximal context set for which br and cl ofPIPE1 are XS-indiscriminable.Instead, the maximal context set XSmax(PUMP1, up, op)for which up and op of PUMP1 are XS-indiscriminable con-


tains just one instantiation where ∆s1 = + and ∆s2 = −.This is the kind of processing that can be performed off line: itprovides the basis for determining the abstraction fragments.During the diagnostic reasoning abstraction fragments areactually used for synthesizing on the fly an abstract model tobe used for performing diagnosis in a temporal window.Let us suppose that we know that in a given time interval wewill have ∆s1 = −. This partial information on the operat-ing conditions is sufficient for activating the ComputeAM()algorithm where XS is:

∆s1 = − , ∆s2 = −∆s1 = − , ∆s2 = 0∆s1 = − , ∆s2 = +

(which corresponds to the constraint ∆s1 = −).It is easy to see that XS is a subset of the maximal contextset reported above for modes lk, pc and br, cl of PIPE1 soa non-trivial abstraction mapping can be derived for PIPE1(and PIPE2); on the contrary, an identity mapping is derivedfor PUMP1 (and PUMP2) since XS is not included in themaximal context set XSmax(PUMP1, up, op) .This means that, as long as ∆s2 = −, the diagnostic sys-tem can safely use the abstract system description TV SDA

instead of the original one TV SD. While TV SDA andTV SD coincide with respect to the models of PUMP1,PUMP2 and JOIN, components PIPE1 and PIPE2 are re-placed by the abstract components PIPE1A and PIPE2A thathave been described in Example 3.2.

5 Implementation with OBDDsThe algorithms reported in Figures 3, 4 and 5 are given interms of relational operators. The complexity of the algo-rithms strongly depends on the number of tuples occurring inthe relations which constitute their inputs and (intermediate)results.However, let us first consider the number of operations in-dependently of the sizes of the involved relations. The al-gorithm in Figure 3 performs a linear (w.r.t. the number ofsystem variables |SV|) number of operations on the relationsit produces starting from DT ; the algorithm in Figure 4 per-forms O(|D(c)|3) operations on relations which are intersec-tions of XS inst

max(c, bm, bm′); finally, the algorithm in Fig-ure 5 performs O(n · |Dmax|2) checks for the inclusion ofXS in XSmax(c, bmj, bmk)9.This analysis shows that the complexity of the algorithmsw.r.t. the number of relational operations is very low. There-fore, the sizes of the relations is the critical issue.Since the sizes of the involved relations can in general be verylarge, we have adopted OBDDs for encoding and manipulat-ing all the relations involved in the algorithms (including theDomain Theory DT ); the choice of OBDDs has been dic-tated by their well known ability of compactly representinghuge state spaces in many practical cases (e.g. [Bertoli et al.,2001], [Jensen and Veloso, 1999]) even if there is no theoret-ical guarantee that the encoding of a relation yields a smallOBDD.

9Where n is the number of components and Dmax is the largestdomain of a component.

As concerns the computation of the Abstraction Mapping,a significant complexity result holds.

Property 5.1 Let the sizes of the OBDDs encoding contextset XS and all maximal context sets XSmax(c, bm, bm′)be smaller than some limit T . Then the complexity ofComputeAM() is O(n · |Dmax|2 ·T 2) where n is the numberof components in the system.

This result shows that the computation of the AbstractionMapping can be performed in linear time w.r.t. the numberof components, so it can be performed on-line during the di-agnostic reasoning.It is also important to note that the result strongly depends ona property of OBDDs ensuring that each check for inclusionof XS in XSmax(c, bmj, bmk) (which can be implementedas an intersection and a test for equivalence) takes time pro-portional to the product of the sizes of the OBDDs represent-ing XS and XSmax(c, bmj, bmk).

We have performed a preliminary test of our approach ona simplified version of the propulsion system used in [Kurienand Nayak, 2000]. In particular, the considered model is sim-ilar to the one used in [Torasso and Torta, 2005]: we have 33components (involving 3 tanks, 17 valves, 7 splits, 4 joins,and 2 engines); the static part of the model requires 90 multi-ple valued variables, while the dynamic part involves 33 ad-ditional variables. Set O consists of 5 observable variablesdescribing flows (in terms of nominal flow, reduced flow andno flow) and 2 observable variables reporting the thrust of theengines; Pexo involves 17 variables which represent the com-mands sent to the corresponding valves.The encoding of the model of the propulsion system with OB-DDs has produced a quite compact representation (the OBDDrepresenting the domain theory has a size of 3806 nodes).The adoption of OBDDs has proved essential for computingand encoding maximal context sets XSmax(c, bm, bm′). Infact, we have that the maximum size of the OBDD represent-ing a XSmax(c, bm, bm′) involves just 96 nodes while thenumber of tuples (each one representing an instantiation ofthe 17 variables of Pexo) contained in XSmax(c, bm, bm′) isup to 128,768.

The compact representation of the structures used in thealgorithms of Figures 3 and 4 has the further advantage toallow an efficient computation of XSmax(c, bm, bm′): infact, the computation of all the abstraction fragments takes54,289 msec of CPU time on a laptop machine equippedwith Centrino CPU at 1.4GHz and 512MB RAM. This CPUtime is quite good (less than one minute) if we considerthat in the propulsion domain we have to compute 175XSmax(c, bm, bm′) and that this operation can be done off-line.As expected (Property 5.1) the on-line computation of the ab-straction mapping AM given a context set XS is very effi-cient since it takes less than 5 msec of CPU time by applyingthe algorithm reported in figure 5.

6 ConclusionsSo far, only a few methods have been proposed for automaticmodel abstraction in the context of MBD and most of themdeal with static system models. Relevant examples are the


methods described in [Sachenbacher and Struss, 2004] and in[Torta and Torasso, 2003] which guarantee that the abstractmodels they build preserve all the relevant diagnostic infor-mation, i.e. there is no loss of discrimination power whenusing the abstract model so that the ground model can poten-tially be substituted by the abstract one.A first proposal of automatic abstraction of time-varying sys-tems is described in [Torasso and Torta, 2005] where systemabstractions are computed after the operating conditions ofthe system have been completely specified.

The present paper addresses the problem of automatic ab-straction of time-varying systems for a class of system modelsstrictly larger than the one considered in [Torasso and Torta,2005] since no assumptions about directionality of the modelare made, i.e. the Domain Theory is a relation among systemvariables.

More important, the abstractions are not built from scratchfor each specific assignment to the variables representing op-erating conditions but are obtained by combining suitable ab-straction fragments that have been precomputed off-line.The formal and experimental analysis of the approach haspointed out some relevant properties. First of all, as [Sachen-bacher and Struss, 2004] and [Torta and Torasso, 2003], weguarantee that the information relevant for diagnosis is pre-served and that it is possible to convert back and forth be-tween detailed and abstract diagnoses (Property 3.2).Second, if abstraction fragments are encoded with OBDDsof limited size, the on-line synthesis of abstraction map-pings from the fragments is guaranteed to be efficient (Prop-erty 5.1). Finally, the adoption of OBDDs for computing andencoding the abstraction fragments has the potential of mak-ing the off-line computation of fragments much more effi-cient (both in space and time) than a direct representation ofthe possibly huge involved relations; a preliminary verifica-tion of this claim has been provided by testing the approachon the model of a propulsion system as shown in section 5.

References[Bertoli et al., 2001] P. Bertoli, A. Cimatti, M. Roveri, and

P. Traverso. Planning in nondeterministic domains un-der partial observability via symbolic model checking. InProc. IJCAI, pages 473–478, 2001.

[Brusoni et al., 1998] V. Brusoni, L. Console, P. Terenziani,and D. Theseider Dupre. A spectrum of definitions fortemporal model-based diagnosis. Artificial Intelligence,102:39–79, 1998.

[Chittaro and Ranon, 2004] L. Chittaro and R. Ranon. Hi-erarchical model-based diagnosis based on structural ab-straction. Artificial Intelligence, 155(1–2):147–182, 2004.

[Cimatti et al., 2003] A. Cimatti, C. Pecheur, and R. Cavada.Formal verification of diagnosability via symbolic modelchecking. In Proc. IJCAI, pages 363–369, 2003.

[Console et al., 2002] L. Console, C. Picardi, and M. Rib-audo. Process algebras for system diagnosis. ArtificialIntelligence, 142(1):19–51, 2002.

[Jensen and Veloso, 1999] R. M. Jensen and M. M. Veloso.Obdd-based universal planning: Specifying and solving

planning problems for synchronized agents in nondeter-ministic domains. LNCS, 1600:213–248, 1999.

[Kurien and Nayak, 2000] J. Kurien and P. P. Nayak. Backto the future for consistency-based trajectory tracking. InProc. AAAI, pages 370–377, 2000.

[Mozetic, 1991] I. Mozetic. Hierarchical model-based diag-nosis. Int. Journal of Man-Machine Studies, 35(3):329–362, 1991.

[Provan, 2001] G. Provan. Hierarchical model-based diag-nosis. In Proc. DX, pages 167–174, 2001.

[Sachenbacher and Struss, 2004] M. Sachenbacher andP. Struss. Task-dependent qualitative domain abstraction.Artificial Intelligence, 162(1–2):121–143, 2004.

[Struss et al., 1996] P. Struss, A. Malik, and M. Sachen-bacher. Qualitative modeling is the key to automated diag-nosis. In Proc. IFAC96, 1996.

[Torasso and Torta, 2005] P. Torasso and G. Torta. Au-tomatic abstraction of time-varying system models formodel based diagnosis. LNAI, 3698:176–190, 2005.

[Torta and Torasso, 2003] G. Torta and P. Torasso. Auto-matic abstraction in component-based diagnosis driven bysystem observability. In Proc. IJCAI, pages 394–400,2003.


Reliability and Diagnostics of Modular Systems: a New Probabilistic ApproachMichael Wachter, Rolf Haenni, Jacek Jonczy

University of BernInstitute of Computer Science and Applied Mathematics

Neubruckstr. 10, CH-3012 Bern, SwitzerlandEmail: wachter,haenni,[email protected]

Tel. +41 31 631 8643 Fax +41 31 631 3260

AbstractReliability theory and model-based diagnosticsare two different, although closely related fields.Both have their own well-developed techniquesand computational models, but their connection israrely being pointed out. This paper starts from acommon modular system description, whose un-derlying Boolean function is transformed into acompact graphical representation. The goal is toshow how to use this technique for a probabilisticanalysis of both the system’s reliability and the di-agnostic problem.

1 IntroductionTechnical systems have the unfortunate property of occa-sionally malfunctioning. A system failure may be causedby either a single faulty component or a combination of si-multaneously faulty components. A faulty component itselfmay result from one or several faulty sub-components, andso one. This raises two important questions. The first oneconcerns the overall reliability of the entire system, whichis usually reciprocally proportional to the system complex-ity. The second question arises if the system (or some of itscomponents) is observed to be malfunctioning. The prob-lem then is to find possible diagnoses explaining of the causeof defect. These two problems, evaluating the system’s re-liability and finding diagnoses in case of failures, are obvi-ously closely interconnected, but this perception is surpris-ingly not very common in the literature [Kohlas et al., 2001;Anrig and Kohlas, 2002]. This paper proposes a commonformal approach for both problems.

Reliability Theory vs. Model-Based DiagnosticsTraditionally, reliability has been studied by engineers whowere concerned with the capacity of a component or a sys-tem of such components to perform as designed. The result-ing reliability theory [Bazovsky, 1961; Ravichandran, 1991;Rausand and Høyland, 1994] is – broadly defined – a col-lection of various techniques such as Failure Modes andEffects Analysis (FMEA), Hazard and Operability Analysis(HAZOP), Fault Tree Diagrams (FTD), Event Tree Analy-sis (ETA), Cause-Consequence Analysis (CCA), ReliabilityBlock Diagrams (RBD), Management Oversight Risk Trees

(MORT), and many more. Most of them start from a modularview and propose tree- or graph-based representations of thesystem. The classical areas of application are safety-criticaltechnical and industrial systems such as airplanes or nuclearplants, but today similar techniques are also applied to inves-tigate software systems, networks, or human-dependent ad-ministration or management systems. A related area of re-search is risk analysis [Andrews and Moss, 2002].

Diagnosing malfunctioning systems is an important re-search topic in the area of Artificial Intelligence. Most ap-proaches are model-based, i.e. they start from a more or lessconcise description of the entire system (components, struc-ture, functionality, . . . ). The level of granularity or detail ofsuch models depends on both the availability of data about thesystem and the intended purpose of the model. Compared toreliability models such as FTDs or RBDs, models in the con-text of model-based diagnostics are usually much more accu-rate. They usually consist of a set of sentences in a formallanguage such as constraint or predicate logic [Reiter, 1987;de Kleer et al., 1992; Kohlas et al., 1998; Frohlich, 1998].

For a given model and some observations about the cur-rent system behavior, the goal of model-based diagnostics isto decide if the system is subjected to a fault or not, and incase of fault to deduce which faults explain the observed be-havior. Such an explanation is called diagnosis. Recently,various authors have recognized the importance of recursivelysuppressing structural details by aggregating parts of the sys-tem into modules. Exploiting the modular structure of a sys-tem can significantly reduce the complexity of reasoning withrespect to a non-modular approach [Autio and Reiter, 1998;Kohlas et al., 2001; Chittaro and Ranon, 2004].

Classification of ModelsIndependently of whether the primary focus is on reliabilityor diagnosis, the operation of a system can be consideredfrom two distinct standpoints: we can look at various waysfor a system to operate properly, or we can look at variousways for a system to fail. In general, working in the failurespace is more common than working in the success space, butthe two standpoints are usually complementary.

Another general classification is between qualititative andquantitative methods. The primary goal of a qualitative reli-ability analysis is to identify the minimal cut sets (combina-tions of faulty components) of the system. Their dual coun-


terpart are the minimal path sets (combinations of functioningcomponents). A quantitative analysis aims at measuring thesystem reliability by the probability that a system or part of asystem will work. An alternative quantitative measure is thecomplementary rate of failure [Rausand and Høyland, 1994].

Most approaches to model-based diagnosis are qualitative.Their primary goal is find minimal diagnoses, i.e. combina-tions of faulty components explaining the observed failure.Minimal diagnoses are usually derived from the set of min-imal conflicts, their dual counterpart. Some authors haveproposed a probabilistic analysis [Tawfik and Neufeld, 1998;Anrig, 2000b; Lucas, 2001], but this is still a relatively unex-plored field. The goal of such Bayesian methods is to calcu-late conditional probabilities over some target variables con-ditioned on the observed values of other variables, e.g. theprobability of a component being broken given that the sys-tem is observed to be malfunctioning.

Among quantitative methods, another important classifica-tion is between static and dynamic models. Due to the re-stricted life time of most components, the reliability of a sys-tem typically decreases in the course of time. By consideringtime-dependent probability functions such as failure distribu-tions, dynamic models try to take this aspect into account.This improves the accuracy of the analysis, but also increasesits complexity. Static models start from a common time inter-val, which reduces the time-dependent distributions to singleprobability values.

In this paper, we start from a probabilistic success model.It describes a system as a collection of interconnected compo-nents, which are possibly grouped into nested modules. Eachcomponent of the system is supposed to behave like an inde-pendent stochastic variable with a corresponding failure prob-ability. Dynamic aspects can be handled similarly to the staticversion, as we will see.

Computational TechniquesThe simplest form of modeling the successful behavior of asystem (or module) is to consider components with two statesof operation only: either the component is working properlyor it is defect. Accordingly, we will refer to them as a com-ponent’s success or failure state, respectively. Formally, eachcomponent is thus described by a Boolean variable, where 1represents its success state and 0 its failure state. The overallstate of operation of the entire system (the so-called target ortop-level event) can then be described by a Boolean functionf over the product space 0, 1r, where r is the number ofcomponents. Of course, if r gets large, it may be impractica-ble to specify f explicitly, and that’s why the use of sophis-ticated computational techniques to represent and manipulateBoolean functions is of great importance.

In reliability theory, the most common way of handlingcomplex Boolean functions is to transform the original sys-tem description (FTD, RBD, etc.) into minimal cut sets orminimal path sets. From the perspective of propositionallogic, this is essentially a representation of the underlyingBoolean function by means of a minimal Disjunctive NormalForm (DNF). The same remark holds for the concepts of min-imal conflicts and minimal diagnoses in the context of model-based diagnostics. Note that DNF representations are prob-

lematical, since the number of necessary terms may grow ex-ponentially with the number r of Boolean variables. To over-come this difficulty, some authors suggested the use of morecompact representations such as Binary Decision Diagrams(BDD) [Krieger et al., 1993; Sinnamon and Andrews, 1996;Zang et al., 2003].

Another problem of general DNF representations is theirlack of support for efficient probability computations, whichis the core operation of most quantitative approaches.The most popular alternative to the rudimental inclusion-exclusion principle is the transformation of the given DNFinto a disjoint DNF with non-interleaving terms. In the liter-ature, this transformation is known as the problem of gen-erating disjoint Sum-of-Products (SOP) [Rai et al., 1995;Rauzy et al., 2003]. What most of the existing SOP algo-rithms have in common is the relative flatness of their re-sults, if they are regarded as trees or directed graphs [Abra-ham, 1979; Heidtmann, 1989; Bertschy and Monney, 1996;Chatelet et al., 1999; Anrig, 2000a; Heidtmann, 2002]. Morepromising is thus the idea of using free or ordered BDDs,which are known for their compactness, but still allow prob-ability computations in polynomial time with respect to theirsize [Fey and Drechsler, 2002].

A recent paper on Propositional Directed Acyclic Graphs(PDAG) offers a similar but more powerful solution for theSOP problem [Wachter and Haenni, 2006]. The key observa-tion concerns the members of a certain PDAG sub-class, theones that are at the same time decomposable and determin-istic. They are provably more succinct than correspondingfree or ordered BDDs, disjoint DNFs, or other similar forms,but they still allow probability computations in polynomialtime (as well as satisfiability checking, clause entailment,model counting, etc.). According to [Wachter and Haenni,2006], it seems that decomposable and deterministic PDAGsare the optimal representation of Boolean functions with re-gard to probability computations. This is the starting pointand the main motivation for this paper. A short summary ofthe PDAG technique is given in Section 3.

Goals and OverviewThe close relationship and duality between reliability theoryand model-based diagnostics allows to conceive methods thatmay be used to deal with problems in both areas. As waspointed out before, we will start from a modular descrip-tion of a system’s success state and use the above-mentionedPDAG sub-class to represent the underlying Boolean func-tion. This will allow us to perform the necessary probabilitycomputations in an efficient way for both reliability and di-agnostic purposes. As we will see, the modular structure ofthe system guarantees the decomposability of the resultingPDAG, which in turn helps to keep the necessary computa-tional work load within feasible limits.

The remainder of this paper is organized as follows. In Sec-tion 2, we will start with an introductory example. Section 3introduces the PDAG representation of Boolean functions andsummarizes the most important results. Section 4 proposes aformal model for modular systems. Section 5 explores thequantitative reliability analysis using PDAGs, which is thenapplied in Section 6 for finding the most plausible diagnoses.


2 Introductory ExampleThe most popular methods to study the behavior of modularsystems are Reliability Block Diagrams (RBD) and Fault TreeDiagrams (FTD). They mainly differ in two points, namelythat RBDs represent the system’s success state and are mostlyconstructed bottom-up, whereas FTDs describe the comple-mentary failure state and are usually constructed top-down.Traditionally, fault trees have more often been used with fixedfailure probabilities, while RBDs may have included time-varying distributions for the success, but this is not a concep-tual limitation of either side.

RBDs are graphical representations of the components andmodules (represented as blocks) of the system and of howthey are reliability-wise arranged. This means that the suc-cess state of the system or module is represented in termsof the success states of its individual components. Note thatthis may differ from how the components are physically con-nected. FTDs are also a graphical design technique, but theydisplays the failure state of a system (called top-level event)in terms of the failure states of its components (called basicevents) and by means of a tree in which the nodes are logi-cal gates connecting its branches. The root of the tree repre-sents the top-level event, and the terminal nodes are the basicevents. Note that FTDs can be converted to RBDs and viceversa.

To illustrate these concepts, consider a simple modular sys-tem describing the airworthiness of a double-engine aircraft.The system consists of four components s (steering), e1 (en-gine 1), e2 (engine 2), and f (fuel tank). The componentse1 and e2 are both part of a module E (engine). The overallmodule for the entire systems is denoted by A (aircraft). Ac-cording to this modular structure, the aircraft is considered tobe airworthy iff the components s and f as well as the mod-ule E are working. Similarly, E is working iff at least e1 ore2 is working. This simple system is depicted in Fig. 1 bothas a RBD and a FTD. The corresponding minimal path setsare s, e1, f and s, e2, f, and the minimal cut sets are ¬s,¬e1,¬e2, and ¬f.

A

E

e1

e2

s f

+

¬A

¬s ¬f

¬e1 ¬e2

¬E

Figure 1: The aircraft example as a RBD (left hand side) andFTD (right hand side).

To turn our focus on the main topic of this paper, namely thedescription and manipulation of modular systems by efficientrepresentations of Boolean functions, let the system compo-nents be represented by corresponding Boolean variables s,e1, e2, and f with possible values 1 (success) and 0 (failure).Furthermore, let A be the Boolean variable for the success

state of the entire aircraft. The connection between A andand its components is then expressed by the Boolean func-tion f with A = f(s, f, e1, e2). The following table shows funder all possible configurations. Only three configurationslead to a success state, as indicated in bold.

s 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1e1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1e2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1f 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1A 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1

Throughout this paper, we will assume all components towork and fail independently of each other. Consequently, if

p(s) = 0.8, p(e1) = p(e2) = 0.9, p(f) = 0.7,

are the independent probabilities of the components’ successstates, we can compute the aircraft’s reliability p(A), i.e. theprobability that the entire system is working correctly, by thefollowing sum of products:

p(A) = p(s)∗p(¬e1)∗p(e2)∗p(f)+ p(s)∗p(e1)∗p(¬e2)∗p(f)+ p(s)∗p(e1)∗p(e2)∗p(f)

= 0.0504 + 0.0504 + 0.4536 = 0.5544 .

Similarly, we could obtain the failure probability p(¬A) fromsumming over all failure configurations, but since failureand success configurations are complementary, we obtain thesame result more easily by

p(¬A) = 1− p(A) = 0.4456 .

Of course, less effort would be required to obtain the sameresults from the minimal path sets or cut sets, just by makingthem disjoint. For example, adding ¬e1 to the second mini-mal path set makes them disjoint, and we obtain

p(A) = p(s)∗p(e1)∗p(f) + p(s)∗p(¬e1)∗p(e2)∗p(f)= 0.504 + 0.0504 = 0.5544 .

In the following section, we will introduce a more sophisti-cated technique to further speed up such probability compu-tations with respect to Boolean functions.

3 Propositional DAGsConsider a set V of r propositional variables and a Booleanfunction (BF) f : 0, 1r → 0, 1 [Clote and Kranakis,1998]. Such a function f can also be viewed as the set ofr-dimensional vectors x ∈ 0, 1r for which f evaluatesto 1. This is the so-called satisfying set or set of modelsSf = x ∈ 0, 1r : f(x) = 1 of f , for which an effi-cient representation has to be found.

According to [Wachter and Haenni, 2006], a PropositionalDirected Acyclic Graphs (PDAG) is a rooted, directed acyclicgraph in which each leaf node is represented by © and la-beled with > (true), ⊥ (false), or x ∈ V . Each non-leaf nodeis represented by M (logical and), O (logical or), or (logicalnot). The set of all possible PDAGs of V is called language


a b dc0.6

0.4

0.2

0.8

0.480.08

0.56

0.44

0.5 0.9

0.10.5

0.45 0.05

0.5

0.5

0.280.22

0.5

Figure 2: A PDAGs representing the odd parity function withrespect to V = a, b, c, d.

and denoted by PDAGV or simply PDAG. Fig. 2 depicts an ex-ample (the attached numbers will be commented later).Leaves labeled with > (⊥) represent the constant BF whichevaluates to 1 (0) for all x ∈ 0, 1r. A leaf labeled withx ∈ V is interpreted as the assignment x = 1, i.e. it representsthe BF which evaluates to 1 iff x = 1. The BF represented bya M-node is the one that evaluates to 1, iff the BFs of all itschildren evaluate to 1. Similarly, a O-node represents the BFthat evaluates to 1, iff the BF of at least one child evaluatesto 1. Finally, a -node represents the complementary BF ofits child, i.e. the one that evaluates to 1, iff the BF of its childevaluates to 0. The BF of an arbitrary ϕ ∈ PDAG will bedenoted by fϕ. Two PDAGs ϕ,ψ ∈ PDAG are equivalent, ifffϕ = fψ . This is denoted by ϕ ≡ ψ.

Our convention is to denote PDAGs by lower-case Greekletters such as ϕ, ψ, or the like. The set of variables includedin a sub-PDAG α of ϕ is denoted by vars(α) ⊆ V . Thenumber of edges of a PDAG ϕ is called its size and denotedby |ϕ|. The depth of ϕ, denoted by depth(ϕ), is the maximallength of a path from the root to a leave. PDAGs can have anumber of properties [Darwiche and Marquis, 2002; Wachterand Haenni, 2006], but in the context of this paper, only twoof them are relevant:

• Decomposability: this property holds, if the sets of vari-ables of the children of each M-node α in ϕ are pair-wise disjoint (i.e. if β1, . . . , βl are the children of α, thenvars(βi) ∩ vars(βj) = ∅ for all i 6= j);

• Determinism: this property holds, if the children of eachO-node α in ϕ are pairwise logically contradictory (i.e.if β1, . . . , βl are the children of α, then βi ∧ βj ≡ ⊥ forall i 6= j).

A decomposable and deterministic PDAG is called cd-PDAG.We will use cd-PDAG to refer to the corresponding language,a sub-labguage of PDAG. Note that the example shown inFig. 2 is a cd-PDAGs.

Other sub-languages are obtained from considering furtherproperties: d-DNNF (decomposable and deterministic nega-tion normal forms) is the sub-language of cd-PDAG satisfy-ing simple-negation, FBDD (free BDDs) is the sub-languageof d-DNNF satisfying decision and read-once, OBDD (or-dered BDDs) is the sub-language of FBDD satisfying or-

dering, and d-DNF (disjoint DNFs) is the sub-labguage ofd-DNNF satisfying flatness and simple-conjunction. For amore comprehensive overview and a detailed discussion werefer to [Darwiche and Marquis, 2002; Wachter and Haenni,2006].

A language L1 is more succinct than another language L2,L1 L2, if any sentence α2 ∈ L2 has an equivalent sentenceα1 ∈ L1 whose size is polynomial in the size of α2. A lan-guage L1 is strictly more succinct than another language L2,L1 ≺ L2, iff L1 L2 and L2 6 L1. With respect to theabove-mentioned languages, we have the following provenrelationships [Wachter and Haenni, 2006]:

PDAG ≺ cd-PDAG d-DNNF≺ FBDD ≺ OBDD

≺ d-DNF.

It is still unknown whether cd-PDAG is strictly more succinctthan d-DNNF or not. Examples of Boolean functions with apolynomial representation in cd-PDAG but an exponential rep-resentation in d-DNF are the parity functions (odd or even) ork-out-of-n relationships.

We will now see that it is sufficient for a language to of-fer efficient probability computations if both decomposabil-ity and determinism hold. Let p(x) denote the given marginalprobability of a variable x ∈ V being true. If we assumethe Boolean variables in V to be mutually independent, andif ϕ is a cd-PDAG, then the probability p(fϕ) = p(ϕ) of theBoolean function fϕ can be computed by the following recur-sive procedure (see Fig. 2 for an example):

p(ϕ) =

∏i p(βi), if ϕ is a M-node with children βi,∑i p(βi), if ϕ is a O-node with children βi,

1− p(β), if ϕ is a -node with child β,p(x), if ϕ is a ©-node labeled with x ∈ V ,1, if ϕ is a ©-node labeled with >,0, if ϕ is a ©-node labeled with ⊥.

In other words, decomposability and determinism allow toreplace M-nodes and O-nodes by products and sums, respec-tively. Within the language cd-PDAG, probability computa-tions are thus possible in linear time (one arithmetic calcula-tion at each node). In the light of the above succinctness re-lationships, it is thus clear that cd-PDAG, the sub-language ofPDAG that guarantees decomposability and determinism, butnothing more, is the most suitable language for probabilitycomputations.

Another important advantage of cd-PDAGs is the flexibil-ity to obtain their negations just by adding a -node on top(or remove -nodes from the top). In the context of this pa-per, this is particularly useful to easily switch from a successmodel to the complementary failure model and vice versa.Consequently, we may allow hybrid models, which partly de-scribe the success space and partly the failure space of thesystem. The formal model of the following section is mainlyoriented towards the success space, but with cd-PDAGs as theunderlying machinery, this is no restriction at all.

4 Modular SystemsTo benefit from the power of cd-PDAGs in the context of re-liability theory or model-based diagnostics, let us further for-


malize the idea of a modular system. A (non-modular) systemS = (C, f) consists of components C = c1, . . . , cr, r ≥ 1,and a structure function f . To represent a component’s stateof operation, we will use ci not only to denote the componentitself, but also for an associated Boolean variable. Supposethat ci = 1 if ci is working and ci = 0 is ci is faulty.

The structure function is a Boolean function f : 0, 1r →0, 1, which connects the components’ states of operationwith the state of operation of the entire system, i.e. we havef(c1, . . . , cr) = 1 iff the system is working. In reliabilitytheory, f is often restricted to be coherent or monotone (non-decreasing in each argument), but we refrain from doing sohere. Note that 1−f is the complementary structure functionof the system’s failure state.

Since explicitly specifying f becomes impracticable if r islarge, we may do better by exploiting the modular structureof a system. We will thus define a modular system M =(C,M, T, L) to consist of components C = c1, . . . , cr,r ≥ 1, modules M = M0, . . . ,Ms, s ≥ 0, an organiz-ing tree T , and local structure functions L = `0, . . . , `s[Kohlas, 1987].

The organizing tree T is supposed to be directed androoted, and we suppose the elements of C to be the leaves,the elements of M the non-leaves, and M0 the root of the or-ganizing tree. With N = C ∪M we denote the complete setof nodes of T . The successor set succ(Mi) ⊆ N of a moduleMi ∈M contains all its direct descendants in T . Note that thetree structure ensures each single component and each singlemodule to appear in at most one successor set.

The local structure function of each module Mi ∈ M is aBoolean function

ì : 0, 1|succ(Mi)| → 0, 1 ,which describes the state of operation of the module in termsof its constituents succ(Mi). Note that the particular caseM = M0 implies `0 to be the (non-local) structure func-tion of the corresponding (non-modular) system S = (C, `0).Conversely, every system S ′ = (C, f) is equivalent to themodular system M′ = (C, M0, T, f), where T is thetree with root M0 and succ(M0) = C.

In general, any modular system M = (C,M, T, L) un-ambiguously defines the (non-local) structure function of acorresponding (non-modular) system S with the same com-ponents C. To show this connection, let us define the un-folded (non-local) structure function of a component or mod-ule n ∈ N by

f(n) :=

ci, if n = ci ∈ C,ì(f(n1), . . . , f(nt)), if n = Mi ∈M,

with succ(Mi) = n1, . . . , nt. Thus f(Mi) is the struc-ture function of module Mi with respect to the restricted setof components appearing in its own subtree. The modularsystem M = (C,M, T, L) is then equivalent to the systemS = (C, f(M0)).

ExampleConsider the aircraft example from Section 2, but to make itmore interesting, let it now consist of two modules S (steer-ing) and G (gear). Module G is further decomposed in two

sub-modules E (engine) and F (fuel). Module S has two steer-ing components s1 and s2, module E has three components e1

(left engine), e2 (right engine), and e3 (rear engine), and mod-ule F has two components f1 (fuel tank 1) and f2 (fuel tank 2).The structure and behavior of the system is described by theRBD shown in Fig. 3, where 2/3 stands for a “2-out-of-3”-relationship.

G E

S

Fe1

e2

e3

2/3

f1

f2

s1 s2

Figure 3: The RBD of the extended aircraft example.

In terms of the above formal setting, the airworthiness of theaircraft leads to a the modular system M = (C,M, T, L)with C = s1, s2, e1, e2, e3, f1, f2, M = A,S,G,E,F,L = À, `S, `G, È, `F, and the organizing tree T shown inFig. 4. The root node A represents the state of operation ofthe entire aircraft and can be thought of as the overall modulethat makes up the system.

S G

E Fs1 s2

e1 e2 e3 f1 f2

A

Figure 4: The organizing tree in the aircraft example.

To be in accordance with the above RBD, the correspondinglocal structure functions are as follows (∧ and ∨ are the stan-dard logical connectives):

À(S,G) = S ∧ G,

`S(s1, s2) = s1 ∧ s2,

`G(E,F) = E ∧ F,

È(e1, e2, e3) = (e1∧e2) ∨ (e1∧e3) ∨ (e2∧e3),`F(f1, f2) = f1 ∨ f2.

By plugging the local structure functions recursively intoeach other, we obtain the following “global” structure func-tion f(A) for the entire aircraft:f(A) = s1 ∧ s2 ∧ [(e1∧e2)∨(e1∧e3)∨(e2∧e3)] ∧ (f1∨f2).

In the remaining two sections, we will demonstrate how touse PDAGs to compute the reliability of a modular systemand the posterior probabilities of possible diagnoses.

5 ReliabilityReliability is commonly defined as the probability of an item(component, module, system) to operate for a given amount


of time without failure. Of course, the reliability of a modu-lar system depends primarily on the reliability of its modules,which themselves depend on the reliability of their compo-nents. To make a reliability analysis of a modular systemM = (C,M, T, L), suppose that the components fail inde-pendently of each other, and let the probability that a compo-nent ci ∈ C is properly working for a given amount of timebe denoted by p(ci). This is the typical starting point of astatic analysis. In a dynamic setting, p(ci) is simply replacedby a time-dependent probability p(ci, t). This can easily beincorporated into the probability computation of a cd-PDAGjust by including t as an additional parameter.

To compute the reliability of the modular system M ef-ficiently, let each local structure function `i of Mi be repre-sented by a local cd-PDAG λi, where the successors of Mi arethe leaves of λi. For computational techniques to generate cd-PDAGs of Boolean functions or to convert arbitrary PDAGsinto equivalent cd-PDAGs, we refer to [Darwiche, 2002;Wachter and Haenni, 2006] and other forthcoming papers onthis topic.

Recall that the tree structure ensures each single compo-nent and each single module to appear in at most one succes-sor set. Starting from λ0, this allows us to recursively replacein each λi the leaves labeled with a module by the cd-PDAGof the corresponding local structure function, while both de-composability and determinism are preserved. Formally, let

ϕ(n) :=©-node labeled with ci, if n = ci,

λi(ϕ(n1), . . . , ϕ(nt)), if n = Mi,

be the non-local cd-PDAG of each node n ∈ N . Withλi(ϕ(n1), . . . , ϕ(nt)) we denote the cd-PDAG obtained froma module Mi with succ(Mi) = n1, . . . , nt by replacingeach leaf nj of λi by ϕ(nj). In this way, we obtain a non-localcd-PDAG ϕ(Mi) for each module Mi ∈M , i.e. ϕ(M0) is thecd-PDAG representation of M0’s structure function f(M0).Finally, to the obtain the overall reliability of the entire mod-ular systems, all we have to do is to compute the probabilityp(M0) = p(ϕ(M0)) as explained in Section 3.

Example

Consider the extended aircraft example form the previoussection and its description as a modular system. The cd-PDAGs λi of the local structure functions `i are shown inFig. 5, and the global cd-PDAG ϕ(A) of the entire modularsystem is shown in Fig. 6. Note that ϕ(A) is obtained fromλA by replacing the leaves labeled with S and G by ϕ(S) andϕ(G), respectively. Similarly, ϕ(G) is obtained from λG byreplacing the leaves labeled with E and F by ϕ(E) and ϕ(F),respectively, and so on.

The calculation of the aircraft’s reliability p(A) = 0.141is also shown in Fig. 6 (up to up 3 decimal places). It isbased on the following success probabilities of the compo-nents: p(e1) = p(e2) = 0.9, p(e3) = 0.6, p(s1) = 0.7,p(s2) = 0.5, p(f1) = 0.3, and p(f2) = 0.2. These valuesare arguably not be very realistic, but they are suitable for ourillustrative purposes.

S G

λA

s1 s2

λS

E F

λG

e3e2e1

λE

f1 f2

λF

Figure 5: The cd-PDAGs λi of the local structure functions.

s1 s2

e3 f1 f2e1 e20.9 0.6

0.7

0.3

0.5

0.20.9

0.35

0.1 0.4

0.540.054 0.324

0.918

0.7 0.8

0.56

0.44

0.404

0.141

Figure 6: The cd-PDAG ϕA of the entire aircraft and the cal-culation of its probability.

6 DiagnosticsLet us now turn our attention to the problem of finding pos-sible diagnoses if the system or parts of it are observed tobe malfunctioning. The method we proposed is based onthe reliability analysis of the previous section. The goal isto compute conditional probabilities of a module or compo-nent being broken given some observations about the stateof operation of one or several parts of the system (includingthe entire system). Computing such conditional probabilitiesis generally the spirit of a Bayesian analysis, which is nowapplied to the diagnostic problem.

As before, the discussion will be focused on the static case,where each component ci ∈ C has an fixed success proba-bility p(ci). Furthermore, we suppose the Boolean functionof a modular system M to be represented by the cd-PDAGϕ(M0) of its top-level module M0, as explained in the previ-ous section. Note that the cd-PDAG ϕ(n) of any module orcomponent n ∈ N is contained in ϕ(M0) as a sub-graph (seeexample in Fig. 6). We will use ϕ(¬n) to denote the -nodewith child ϕ(n), thus representing the negation of ϕ(n).

Now let n1obs, . . . , n

kobs ⊆ N be the observed modules

or components, and let obsi ∈ niobs,¬niobs denote whetherniobs is working or not. Furthermore, suppose that q ∈ N isthe component or module under investigation. In this paper,we only consider the case of single queries, but this is not aconceptual restriction. The conditional probability that q isworking, is calculated by

p(q|obs1, . . . , obsk) =p(q, obs1, . . . , obsk)p(obs1, . . . , obsk)

.

To see how to compute p(q|obs1, . . . , obsk) using the givencd-PDAG ϕ(M0), let us first restrict our discussion to the par-ticular case of a single observation obs ∈ nobs,¬nobs. Notethat the cd-PDAGs ϕ(q) and ϕ(nobs) are both sub-PDAGs of


ϕ(M0), while ϕ(¬nobs) is easily obtained from ϕ(nobs) byadding a -node on top. The posterior success probability ofq can thus be written as

p(q|obs) =p(q, obs)p(obs)

=p(ϕ(q)∧ϕ(obs))

p(ϕ(obs)),

where ϕ(q)∧ϕ(obs) represents the PDAG obtained from con-necting ϕ(q) and ϕ(nobs) with a M-node. Note that this newM-node is not necessarily decomposable, i.e. the entire struc-ture is no longer a cd-PDAG and does thus not allow probabil-ity computations. This is illustrated in Fig. 7 for the extendedaircraft example with obs = ¬A and q = F.

s1 s2

e3 f1 f2e1 e20.9 0.6

0.7

0.3

0.5

0.20.9

0.35

0.1 0.4

0.540.054 0.324

0.918

0.7 0.8

0.56

0.44

0.404

0.141

0.859

???

Figure 7: The PDAG ϕ(¬A) ∧ ϕ(F).

To make the new M-node decomposable, let us take a look atthe organizing tree of the modular system. This reveals threepossible cases:(1) q and nobs are in distinct sub-trees,(2) nobs is a sub-tree of q,(3) q is a sub-tree of nobs.

The first case implies that ϕ(q) and ϕ(obs) share no commonvariables. This makes ϕ(q) ∧ ϕ(obs) decomposable and al-lows its probability to be computed by p(ϕ(q)∧ϕ(obs)) =p(ϕ(q))·p(ϕ(obs)). This implies p(q|obs) = p(ϕ(q)), whichindicates that the query variable is not at all affected by theobservation. For example, observing an empty fuel tank hasno influence on steering.

The second and the third case is more challenging, sinceϕ(q)∧ϕ(obs) is not decomposable. To solve this problem, letus quickly review an operation for PDAGs called condition-ing [Wachter and Haenni, 2006]. Consider an arbitrary PDAGϕ, a sub-PDAG ψ of ϕ, and τ ∈ ψ,¬ψ. Conditioning ϕon τ generates from ϕ a new PDAG ϕ|τ by replacing ψ by >(for τ = ψ) or ⊥ (for τ = ¬ψ), followed by simplificationssuch as >∨α = >, >∧α = α, ⊥∨α = α, and⊥∧α = ⊥.Note that conditioning preserves both decomposability anddeterminism. Furthermore, it ensures τ ∧ ϕ ≡ τ ∧ ϕ|τ andguarantees decomposability for τ ∧ ϕ|τ .

In the second case, we may thus generate the cd-PDAGϕ(nobs)∧ϕ(q)|ϕ(nobs) and compute its probability to obtainthe required numerator. The third case is analogue, exceptthat q and nobs change their roles. An example of the thirdcase is shown in Fig. 8 with obs = ¬A and q = F (as before).

In comparison with the PDAG of Fig. 7, the new cd-PDAGϕ(F)∧ϕ(¬A)|ϕ(F) contains three new nodes, for which newprobabilities need to be computed (indicated in bold). Finally,we get F’s conditional probability given ¬A by

p(F|¬A) =p(ϕ(F) ∧ ϕ(¬A)|ϕ(F))

p(ϕ(¬A))=

0.2990.859

= 0.348.

In the same way, it is possible to compute the conditionalprobabilities of all modules and components of the system.The ones with the highest values are then the most probablediagnoses. The case of multiple observations works similarly,although a more general form of conditioning is required. Inany case, conditioning is always an efficient operation.

s1 s2

e3 f1 f2e1 e20.9 0.6

0.7

0.3

0.5

0.20.9

0.35

0.1 0.4

0.540.054 0.324

0.918

0.7 0.8

0.56

0.44

0.679

0.321

0.299

Figure 8: The resulting cd-PDAG ϕ(F) ∧ ϕ(¬A)|ϕ(F).

7 ConclusionReliability theory and diagnosis of modular systems areclosely related worlds. One could say that reliability is thespecial case of diagnosis without observations. cd-PDAGsturn out to be an adequate computational technique for both,since decomposability and determinism guarantee that thecomputation of probabilities is polynomial with respect to thesize of the cd-PDAG. Furthermore, observations can be han-dled easily, according to the structure of the organizing tree.

AcknowledgmentsThis research is supported by the Swiss National ScienceFoundation, Project No. PP002–102652. We greatly appreci-ate the helpful comments of Jurg Kohlas and the anonymousreviewers.

References[Abraham, 1979] J. A. Abraham. An improved algorithm

for network reliability. IEEE Transactions on Reliability,28:58–61, 1979.

[Andrews and Moss, 2002] J. Andrews and B. Moss. Relia-bility and Risk Assessment (2nd Edition). American Soci-ety of Mechanical Engineers, 2002.

[Anrig and Kohlas, 2002] B. Anrig and J. Kohlas. Model-based reliability and diagnostic: A common frameworkfor reliability and diagnostics. In M. Stumptner and


F. Wotawa, editors, DX’02, 13th Interational Workshopon Principles of Diagnosis, pages 129–136, Semmering,Austria, 2002.

[Anrig, 2000a] B. Anrig. A generalization of the algorithmof Abraham. In M. Nikulin and N. Limnios, editors,MMR’2000: Second International Conference on Math-ematical Methods in Reliability, pages 95–98, Bordeaux,France, 2000.

[Anrig, 2000b] B. Anrig. Probablistic Model-Based Diag-nostics. PhD thesis, University of Fribourg, Switzerland,2000.

[Autio and Reiter, 1998] K. Autio and R. Reiter. Structuralabstraction in model-based diagnosis. In H. Prade, editor,ECAI’98, 13th European Conference on Artificial Intelli-gence, pages 269–273, Brigthon, U.K., 1998.

[Bazovsky, 1961] I. Bazovsky. Reliability Theory And Prac-tice. Prentice Hall, Englewood Cliffs, USA, 1961.

[Bertschy and Monney, 1996] R. Bertschy and P. A. Mon-ney. A generalization of the algorithm of Heidtmann tonon-monotone formulas. Journal of Computational andApplied Mathematics, 76:55–76, 1996.

[Chatelet et al., 1999] E. Chatelet, Y. Dutuit, A. Rauzy, andT. Bouhoufani. An optimized procedure to generate sumsof disjoint products. Reliability Engineering and SystemSafety, 65:289–294, 1999.

[Chittaro and Ranon, 2004] L. Chittaro and R. Ranon. Hi-erarchical model-based diagnosis based on structural ab-straction. Artificial Intelligence, 155(1–2):147–182, 2004.

[Clote and Kranakis, 1998] P. Clote and E. Kranakis.Boolean Functions and Computation Models. Springer,1998.

[Darwiche and Marquis, 2002] A. Darwiche and P. Marquis.A knowlege compilation map. Journal of Artificial Intelli-gence Research, 17:229–264, 2002.

[Darwiche, 2002] A. Darwiche. A compiler for determin-istic, decomposable negation normal form. In AAAI’02,18th National Conference on Artificial Intelligence, pages627–634. AAAI Press, 2002.

[de Kleer et al., 1992] J. de Kleer, A. K. Mackworth, andR. Reiter. Characterizing diagnoses and systems. Artifi-cial Intelligence, 56(2–3):197–222, 1992.

[Fey and Drechsler, 2002] G. Fey and R. Drechsler. UtilizingBDDs for disjoint SOP minimization. In MWSCAS’02,45th IEEE International Midwest Symposium on Circuitsand Systems, pages 306–309, Tulsa, USA, 2002.

[Frohlich, 1998] P. Frohlich. DRUM–II: Efficient Model–Based Diagnosis of Technical Systems. PhD thesis, Uni-versity of Hannover, Germany, 1998.

[Heidtmann, 1989] K.D. Heidtmann. Smaller sums of dis-joint products by subproducts inversion. IEEE Transac-tions on Reliability, 38(4):305–311, 1989.

[Heidtmann, 2002] K. D. Heidtmann. Statistical comparisonof two sum-of-disjoint-product algorithms for reliability

and safety evaluation. In SAFECOMP 2002, 21st Inter-national Conference on Computer Safety, Reliability andSecurity, pages 70–81. Springer, 2002.

[Kohlas et al., 1998] J. Kohlas, B. Anrig, R. Haenni, andP. A. Monney. Model-based diagnostics and probabilis-tic assumption-based reasoning. Artificial Intelligence,104:71–106, 1998.

[Kohlas et al., 2001] J. Kohlas, B. Anrig, and R. Bissig. Re-liability and diagnostic of modular systems. ORiON:The Journal of the Operations Research Society of SouthAfrica, 16(1):47–62, 2001.

[Kohlas, 1987] J. Kohlas. Zuverlassigkeit und Verfugbarkeit.Teubner, 1987.

[Krieger et al., 1993] R. Krieger, B. Becker, and R. Sinkovic.A BDD-based algorithm for computation of exact faultdetection probabilities. In FTCS’93, 23rd InternationalSymposium on Fault-Tolerant Computing, pages 186–195,Toulouse, France, 1993. IEEE Computer Society.

[Lucas, 2001] P. J. F. Lucas. Bayesian model-based diag-nosis. International Journal of Approximate Reasoning,27(2):99–119, 2001.

[Rai et al., 1995] S. Rai, M. Veeraraghavan, and K. S.Trivedi. A survey on efficient computation of reliabilityusing disjoint products approach. Networks, 25(3):147–163, 1995.

[Rausand and Høyland, 1994] M. Rausand and A. Høyland.System Reliability Theory: Models and Statistical Meth-ods. John Wiley and Sons, New York, USA, 2nd edition,1994.

[Rauzy et al., 2003] A. Rauzy, E. Chatelet, Y. Dutuit, andC. Brenguer. A practical comparison of methods to as-sess sum-of-products. Reliability Engineering and SystemSafety, 79(1):33–42, 2003.

[Ravichandran, 1991] N. Ravichandran. Stochastic Methodsin Reliability Theory. John Wiley and Sons, 1991.

[Reiter, 1987] R. Reiter. A theory of diagnosis from firstprinciples. Artificial Intelligence, 32:57–95, 1987.

[Sinnamon and Andrews, 1996] R. M. Sinnamon and J. D.Andrews. Fault-tree analysis and binary decision dia-grams. In IEEE Annual Reliability and MaintainabilitySymposium, pages 215–222, Las Vegas, USA, 1996.

[Tawfik and Neufeld, 1998] A. Y. Tawfik and E. Neufeld.Model-based diagnosis: a probabilistic extension. InA. Hunter and S. Parsons, editors, Applications of Uncer-tainty Formalisms, LNCS 1455, pages 379–396. Springer,London, U.K., 1998.

[Wachter and Haenni, 2006] M. Wachter and R. Haenni.Propositional DAGs: a new graph-based language for rep-resenting Boolean functions. In KR’06, 10th InternationalConference on Principles of Knowledge Representationand Reasoning, Lake District, U.K., 2006.

[Zang et al., 2003] X. Zang, D. Wang, H. Sun, and K. S.Trivedi. A BDD-based algorithm for analysis of multistatesystems with multistate components. IEEE Transactionson Computers, 52(12):1608–1618, 2003.


Index of Authors

A

Alonso González, Carlos 31

B

Barta, César 3 Bayoudh, Mehdi 9 Benayadi, Nabil 17 Biswas, Gautam 69, 243

Biteus, Jonas 23 Boel, R. K. 125 Bouché, P. 17

Bregon, Anibal 31

C

Cannas, Barbara 39 Chaudron, M. 257

Combacau, Michel 251 Console, Luca 47

Cordier, Marie-Odile 55, 61, 117

D

Daigle, Matthew 69 Dousson, Christophe 77

E

El Mafkouk, R. 163 Escobet, T. 179

Esser, Michael 85

F

Fanni, Alexandra 39 Feldman, Alexander 93

Friedrich, Gerhard 101 Frisk, Eric 23

G

Gabard, J. 163 Van Gemund, Arjan 93

Le Goc, Marc 17 Grastien, Alban 61

Guerra, Pedro 109

H

Haenni, Rolf 273

I

Ingimundarson, Ari 109, 227

J

Jéron, Thierry 117 Jiroveanu, G. 125 Jonczy, Jacek 273

De Jonge, Femke 133

K

De Kleer, J. 141 Koutsoukos, Xenofon 69, 243

L

Lamperti, G. 147 Lunde, R. 155

M

Le Maigat, Pierre 77 Marchand, Hervé 117

Marcos, Andrés 5 Mayer, W 171

Messeguer, J. 179 Montisci, Augusto 39

Moro, Isaac 31

N

Nyberg, Mattias 23, 187, 211

O

Ocampo-Martínez, C. 227 Olive, Xavier 9

P

Peischl, Bernhard 195, 203 Pernestål, Anna 211 Picardi, Claude 47

Pietersma, Jurryt 93 Pinchinat, Sophie 117

Prieto, Oscar 31 Provan, G. 219

Pucel, Xavier 55 Puig, Vicenç 109, 179, 227

Pulido, Belarmino 31

Q

Quevedo, J. 179

R

Rass, Stefan 101 Ressencourt, H. 235

Rodríguez, Juan J. 31 Roos, Nico 133

Roychoudhury, Indranil 243

S

Shchekotykhin, Kostyantyn 101 De Schutter, B. 125

Simon, M. Aránzazu 31 Soldani, S. 251 Soomro, S. 203

Struss, Meter 85 Stumptner, Marcus 171

Su, Rong 257 Subias, Audine 251

T

Tessier, C. 163 Thesseider, D 47

Thomas, J. 235, 251 Torasso, Pietro 265

Tornil, S. 227 Torta, Gianluca 265

Travé-Massuyès, Louise 9, 55, 255

W

Wachter, Michael 273 Wahlberg, B. 211 Weber, Joerg 195

Witteveen, Cees 133 Wotawa, Franz 195, 203

Z

Zanella, Marina 147

active diag

Documents