non-deterministic fault localization in communication ...sethi/papers/03/tr2003-03.pdf · using...

Non-deterministic Fault Localization in Communication Systems Using Belief Networks

Małgorzata Steinder, Adarshpal S. SethiComputer and Information Sciences Department

University of Delaware, Newark, DE

Abstract–This paper investigates non-deterministic fault di-agnosis in communication systems. We introduce a non-deterministic system model for multi-layer fault diagnosis,which incorporates both availability and performance prob-lems, and allows arbitrary relationships among system com-ponents to be represented. We propose a mapping of the lay-ered model into a belief network and investigate an applica-tion of Bayesian reasoning techniques to performing fault lo-calization using a belief network as a fault propagation model.Although it allows very accurate root cause determination, ex-act Bayesian reasoning is infeasible in real-life systems due toits exponential complexity. We introduce adaptations of twoBayesian reasoning techniques for polytrees, iterative beliefupdating and iterative most probable explanation calculation,as approximate fault localization schemes in belief networks ofarbitrary shape. Using end-to-end service failure diagnosis asa case study, we show through simulation that the approximateschemes allow almost optimally accurate fault localization tobe performed in polynomial time. The approximate techniquesare able to identify multiple simultaneous faults, incorporateboth positive and negative information into the reasoning pro-cess, and are resilient to noise in the observed symptoms. Bothtechniques are event-driven. They may be effectively used insystems with various types of dependencies among their com-ponents. We show that fault localization through iterative be-lief updating offers very high accuracy even if the fault propa-gation model contains inaccurate values of the probability dis-tribution. We conclude that Bayesian reasoning is a promisingtool in the area of fault localization.

1 INTRODUCTION

To improve the reliability and performance of a communica-tion system it is important to quickly and accurately detect anddiagnose its faults. Fault localization (also event correlationor root cause diagnosis) [34, 41, 80] isolates the most probableset of faults based on their external manifestations called symp-toms or alarms. Fault localization techniques available todayconcentrate on detecting and isolating faults related to networkconnectivity [20, 41, 77, 80]. The diagnosis focuses on lowerlayers of the protocol stack (typically the physical and data-link layers) [56, 80], and its major goal is to isolate faults re-lated to the availability of network resources, such as a brokencable, an inactive interface, etc. These techniques are usuallydeterministic, i.e., they assume that the dependencies amongthe system components are known with certainty [34, 80] orthat the information about the current system state is alwaysaccurate and complete [29, 41]. Most fault localization tech-niques reported in the literature are window-based [41, 42, 77],i.e., they analyze groups of symptoms that are collected over a

period of time, called time-window.

Fault localization techniques available today are inadequate foruse in the management of modern enterprise services such as e-commerce, telecommuting, virtual private networks [67], andapplication service provisioning [12], which impose new chal-lenges on the fault localization task. In addition to resourceavailability problems in lower layers of the protocol stack, thehigh-level and performance-related problems have to be diag-nosed. The root causes of higher-layer problems may be lo-cated in a different protocol layer or on a different networkhost, which is frequently not visible to the higher-layers. In-formation about the system’s internal structure provided to thefault management application may be incomplete or out-of-date, and it changes dynamically. Since performance-relatedproblems are more frequent than availability related ones, inlarge systems, it is likely that two or more unrelated perfor-mance problems may occur simultaneously. It is also morelikely for the observation of the system state to be ambiguousor inaccurate. In complex, multi-layer communication systemsit is difficult to determine symptom latencies in order to deter-mine the length of an appropriate correlation time-window. Itmay be also necessary to intertwine fault localization with test-ing.

Considering these difficulties, we argue that, in complex,multi-layer communication systems, fault localization tech-niques should aim at the following objectives.

1. The fault localization technique should besuitable for a va-riety of fault localization problems. In particular, it should beable to diagnose various types of faults across multiple proto-col layers, and it should allow various types of dependenciesamong the system components and various levels of uncer-tainty within the knowledge of the system structure and state.2. The technique should beable to isolate multiple simultane-ous faultseven if their symptoms overlap.3. The technique should be able to isolate faultsaccuratelyandit should havepolynomial complexity.4. The technique should beresilient to lost or spurious alarms.5. The technique should beevent-drivenrather than window-based. Event-driven fault localization techniques maintain thesystem state that is updated after every symptom observation.Window-based techniques group symptoms observed over aperiod of time and analyze them collectively at the end of thetime-window. Event-driven techniques allow a symptom to beanalyzed independently of other symptoms, which may hap-pen as soon as the symptom is received by the fault localiza-tion process. Therefore, they utilize time more efficiently. Inaddition, they allow fault localization to be intertwined with

1

testing procedures, which, with event-driven techniques, maybe designed based on the analysis of the previously observedsymptoms.6. The technique shouldfacilitate other fault managementtaskssuch as identifying system components affected by faults.In non-deterministic systems, a measure of confidence that acomponent is indeed malfunctioning should be provided.

1.1 Contributions of this paper

This paper investigates an application of Bayesian reasoningusing belief networks [60] to non-deterministic fault diagnosisin complex communication systems. We address both buildinga non-deterministic system model and diagnosing faults in thepresence of inaccurate information about the system state. Thepaper makes the following contributions to the field of faultmanagement:

• It advances the state-of-the-art in fault management by intro-ducing the requirement of (1) applicability to various fault-localization problems, (2) event-driven diagnosis, and (3) fa-cilitating fault-management tasks other than fault localiza-tion. The paper also emphasizes and justifies the importanceof the technique’s ability (1) to isolate multiple simultaneousfaults, (2) to deal with positive, lost, and spurious symptoms,and (3) to be accurate even if the system model is not.

• It introduces a dependency-graph model for multi-layer faultdiagnosis, which incorporates both availability and perfor-mance problems, and allows arbitrary relationships amongsystem components to be represented. The paper also pro-poses and justifies several refinements of the model that im-prove the model’s feasibility yet preserve its applicability toreal-life communication systems.

• It proposes the mapping of the layered dependency graphinto a belief network for both the general and refined systemmodels, and maps of the fault localization problem into theproblem of computing queries in belief networks.

• It applies two Bayesian inference algorithms that calculatebelief-updating and most-probable-explanation queries insingly-connected belief networks [60] to perform fault local-ization in belief networks with loops. It proposes a heuristicsthat applies the belief-updating algorithm [60] to performevent-driven diagnosis in non-polytree belief networks and,based on the results of belief updating, calculates the expla-nation hypothesis. To apply the most-probable-explanationalgorithm [60] the paper introduces an approximation thatreduces the algorithm’s complexity to polynomial.

• It extends the system model to incorporate uncertainty in-volved in observations of system state resulting from lost andspurious symptoms.

• It refines the dependency model to include the representationof end-to-end service dependencies.

• It evaluates and compares the proposed algorithms using theproblem of end-to-end service failure diagnosis as a casestudy. It also analyzes the impact of including positive, lost,and spurious symptoms into the analysis.

• It shows through simulation that the proposed techniques al-low fault localization to be accurate even if the probabilityestimation in the fault propagation model is inaccurate.

• It proposes the application of other canonical belief network

models to solve fault management problems for which thenoisy-OR model is inadequate.

1.2 Paper organization

In Section 2, we introduce a dependency graph model formulti-layer fault diagnosis. We discuss the difficulties of us-ing such a general model and propose several refinements thatimprove the model’s feasibility. We also discuss methods thatallow the system model to be built automatically.

In Section 3, we define belief networks concepts used in thispaper.

Section 4 proposes the mapping of the fault localization prob-lem into the problem of computing queries in belief networks.

In Section 5, we discuss the application of known algorithmsfor performing Bayesian reasoning in belief networks to faultlocalization. They include the bucket elimination frame-work [17] and two algorithms for calculating queries in singly-connected belief networks (polytrees): (1) belief updating and(2) most-probable explanation [60]. Then, we design andpresent the approximate polynomial fault localization tech-niques which utilize these algorithms.

Section 6 extends the system model to incorporate uncertaintyinvolved in observations of system state resulting from lost andspurious symptoms.

In Section 7, the algorithms proposed in Section 5 are evalu-ated through simulation using the problem of end-to-end ser-vice failure diagnosis as a case study. We first extend thesystem-model template introduced in Section 2 with a micro-model of an end-to-end service. Then, we describe the designof simulation scenarios used in the study. We compare the per-formance and accuracy of algorithms proposed in Section 5,and investigate whether through increasing the system instru-mentation and taking observation noise into account the robust-ness of non-deterministic techniques may indeed be improved.We also investigate the impact of probability estimation on theaccuracy of the proposed fault localization techniques.

In Section 8, we discuss the examples of fault managementproblems for which the noisy-OR model is inadequate, andpropose the application of other canonical belief network mod-els for these purposes.

Section 9 compares the approach proposed in this paper againstthe related work, focusing on the fault localization objectivesintroduced in this section.

2 LAYERED MODEL FORALARM CORRELATION

For the purpose of fault diagnosis, communication systemsare frequently modeled in a layered fashion imitating the lay-ered architecture of the modeled system [27, 51, 80]. Thisapproach provides a natural abstraction of the modeled sys-tem’s entities, reusability of the model’s modules, and abilityto divide the fault management task into separate, simpler sub-tasks. The main purpose of the model is to represent infor-mation about events, i.e., state changes of a communicationsystem’s entities, and their impact on the state of other enti-ties. The ability of a fault in one entity to change the state of

2

other entities, referred to as fault propagation, is an inherentfeature of communicating systems. Because of fault propaga-tion, the effects of abnormal operation of functions or servicesprovided by lower layers may be observed in higher layers.While propagating up the protocol stack, the failures changetheir semantics thereby losing information important for theirlocalization. For example, a failure at the data link layer tosuccessfully transmit a packet across a link may be observedin a higher layer as an inability toping the IP host to whichpackets are transmitted using the failed link. In order to findexplanations of higher-layer problems, it is useful to create afault propagation model. Fault management systems modelfault propagation by representing either causal relationshipsbetween events [11, 29, 80] or dependencies between commu-nication system entities [27, 38, 41, 66].

2.1 Layered model template

In the layered fault model, the definition of entity dependen-cies is based on real-life relationships between layers on a sin-gle host and between network nodes communicating withina single protocol layer. The fault model components may begenerally divided intoservices, protocols, andfunctions[27].A service offered by protocol layerL between nodesa andb (ServiceL(a,b)) is implemented in terms of layerL func-tions on hostsa andb (Network FunctionsL(a) andNetworkFunctionsL(b)), and layerL protocols through which hostsaandb communicate. The layerL protocols running betweenhostsa andb use layerL−1 functions on hostsa andb, andservices that layerL−1 offers between hostsa andb. LayerLfunctions on nodea depend on layerL−1 functions on nodea. The recursive dependencies between services, protocols andfunctions constitute a dependency graph as described in [27]. Itis clear that the observed system disorder may originate fromfailures of network services and functions, as well as proto-cols. However, network management only provides a means todetect and diagnose service- and function-related problems; itdoes not facilitate diagnosing protocol-related failures, therebyenforcing an assumption that the protocols are implementedcorrectly. In fact, detecting incorrect protocol implementationsis a subject of a separate field of study: protocol interoperabil-ity and conformance testing [22, 45]. For this reason, in thispaper, we find it justified to assume that protocols cannot con-tribute explanations to network problems, and therefore theircorresponding nodes may be removed from the dependencygraph. Figure 1 shows the resultant general dependency graphfor a layered network, in whichServiceL(a,b)directly dependsonServiceL−1(a,b).

The general dependency graph template obtained from ser-vices, protocols and functions in different layers provides amacro-view of the relationships that exist in the system. Itmay be argued that fault localization should be performed start-ing from a macro-view to select a potential spot of the prob-lem, and then it should focus on the micro-view of the cho-sen spot [58]. To incorporate the micro-view of the relation-ships within particular model components, the layered modelshould be further refined to include possibly complex relation-ships within services, and within functions in the same layer.The service or function micro-model is composed of internal

ServiceL−1(a,b)

ServiceL(a,b)

ServiceL+1(a,b)

Layer L+1Network FunctionsL+1(a)

Layer LNetwork FunctionsL(a)

Layer L−1Network FunctionsL−1(a)

Layer L+1Network FunctionsL+1(b)

Layer LNetwork FunctionsL(b)

Layer L−1Network FunctionsL−1(b)

Figure 1: Layered network dependency modelfunctions and services used to provide the modeled service orfunction. Thus,Network FunctionsL(a) should be representedas a graph of multiple layer L sub-functions on nodea im-plementingNetwork FunctionsL(a). Similarly, ServiceL(a,b)could be extended into a subgraph containing multiple layerLsub-services used to provideServiceL(a,b). The micro-modelsfor separate services and functions may overlap, thereby intro-ducing additional complexity into the dependency graph.

Fault localization techniques available today aim at diagnos-ing availability-related problems only [6, 40, 56, 78]. This pa-per extends the methodology of root-cause diagnosis to includeperformance related problems. With every dependency graphnode we associate multiple failure modesF1, . . . , Fk, whichrepresent availability and performance problems pertaining tothe service or function represented by the dependency graphnode [72]. In real-life systems, the following conditions aretypically monitored for and considered a service/function fail-ure:

• F1–service/function ceases to exist (e.g., the cable connec-tion is broken),

• F2–service/function introduces unacceptable delay (e.g.,one of the host-to-host links in network layer is congested),

• F3–service/function produces erroneous output (e.g., bit er-rors are introduced in a serial link between routers),

• F4–service/function occasionally does not produce output(e.g., packets are lost due to buffer overflow).

From the perspective ofServiceL(a,b) (see Figure 1), fail-ure modesF1, . . . , Fk of Network FunctionL(a), NetworkFunctionL(b), and ServiceL−1(a,b) are considered faults,while failure modesF1, . . . , Fk of ServiceL(a,b) are consid-ered symptoms. Similarly,Network FunctionL(a) considersits failure modes as symptoms and it considers failure modesof Network FunctionL−1(a) as faults. The micro-model ofServiceL(a,b) or Network FunctionL(a) defines which failuremode occurs inServiceL(a,b) or Network FunctionL(a) whena particular failure mode occurs in a service or function onwhich ServiceL(a,b) or Network FunctionL(a) depends. Tocreate a micro-model ofNetwork FunctionL(a), the knowl-edge of its definition and implementation is required. Themicro-model ofServiceL(a,b) is built based on the knowledgeof the network protocol used to provideServiceL(a,b). Forexample, knowing that layerL protocol implements an errordetection mechanism, one can predict that erroneous outputproduced byServiceL−1(a,b) (condition F3) results in data

3

loss in ServiceL(a,b) (condition F4). When layerL doesnot implement an error detection mechanism, conditionF3 inServiceL−1(a,b) results in conditionF3 in ServiceL(a,b). Un-fortunately, micro-models for specific services or functions aredifficult to generalize. Clearly, however, the micro-model of aservice or function, which may be complex, may make differ-ent failure modes of the service or function interdependent. Forexample, a function or service that experiences failure modeF1 may not experience any of the performance-related failuresF2, F3, andF4. Also, in many network services, unaccept-able delay (conditionF2) frequently coincides with high lossrate (conditionF4). These dependencies are indirect and resultfrom a common cause. For example, a broken cable causesconditionF1 but makes it impossible for conditionsF2, F3,andF4 to occur. Also, overflowing buffers cause both condi-tionsF2 andF4. Since failure modes are abstract notions, usu-ally there are no direct relationships between different failuremodes associated with the same function or service.

2.2 Non-determinism and its representation in the layeredmodel

Various techniques have been proposed to make the processof fault localization accurate and efficient [29, 39, 56, 80]. Acommon feature of these approaches is that their fault model isdeterministic, i.e., a dependency link froma to b implies thatif a fails, thenb also fails. The deterministic model is typi-cally sufficient to represent faults in lower layers of the proto-col stack related to the availability of services offered by theselayers. However, these fault localization techniques are ratherdifficult to apply when faults are Byzantine [61], e.g., relatedto service performance. In the transport and application lay-ers, frequent reconfigurations of service dependencies make itimpossible to keep such a deterministic model up-to-date.

Uncertainty about dependencies between communication sys-tem entities is represented by assigning probabilities to thelinks and/or nodes in the dependency or causality graph [41,42]. In the dependency graph, a weight assigned to a depen-dency link represents the probability that the node at the tail ofthe edge fails given the node at the head of the edge fails [41].Contrary to other publications on this subject [41], in this pa-per, the dependency graph nodes have multiple failure modes.At any time, a service or function corresponding to a depen-dency graph node may experience none, one, or more failuremodes it is associated with. Let us denote byFX the set of allfailure modes associated with nodeX. Clearly, nodeX is inone of2|FX | distinct states at any time. In state0≤s<2|FX |, aservice or function corresponding to nodeX is in failure modeFi∈FX if the i-th position in the binary representation ofscontains 1; otherwise, the service or function corresponding tonodeX is not in failure modeFi. Let us denote bySX theset of all possible states ofX. Let Y1, . . ., Yn be the set ofdependency graph nodes such that for everyi = 1 . . . n, Xdepends onYi. In general, uncertainty involved in the relation-ship betweenX andY1, . . ., Yn is modeled by labelingX withan (n+1)-dimensional probability matrixPX of size |SX | ×|SY1 | × . . . × |SYn

|, wherePX(sX , sY1 , . . ., sYn) = P{X is

in statesX | Y1 is in statesY1 ∧ . . . ∧ Yn is in statesYn}, and

eachsZ ∈ SZ .

The exponential size of probability matricesPX prohibits thecreation of the model in practical situations. In fact, the rep-resentation of conditional probability distribution with matri-ces of this complexity is not necessary in real-life systems.The following independence assumptions may be made, whichsimplify the model and thereby improve its feasibility.

1. Failure modes are not directly dependentRecall from Section 2.1 that failure modes within a function orservice may be interdependent. These dependencies are indi-rect; they result from the fact that two different failure typesmay occur as a result of the same underlying problem. Usuallythere are no direct dependencies between failure modes withina function or service. Therefore, given the known states ofnodesY1, . . ., Yn, the failure modes of nodeX are independentof each other. Thus, instead of a single(n+1)-dimensionalprobability matrixPX of size |SX | × |SY1 | × . . . × |SYn

|,nodeX may be labeled with an(n+1)-dimensional proba-bility matrix P1

X of size |FX | × |SY1 | × . . . × |SYn|, where

P1X(FX,i, sY1 , . . . , sYn) = P{X is in failure modeFi | Y1 is in

statesY1 ∧ . . . ∧ Yn is in statesYn}. WhenSX

(i)={sX ∈ SX

| if nodeX is in statesX it is in failure modeFi}, the followingrelationship holds:P1

X(FX,i, sY1 , . . . , sYn) =∑

sX∈SX(i)

PX(sX , sY1 , . . . , sYn)

2. Contributions by neighboring nodes are independentFurther refinement of the model may be achieved by makingthe assumption that services or functions independently con-tribute to the failure ofX. In other words, given failures ofYi and Yj that may cause a failure ofX, whether the fail-ure of Yi actually causes the failure ofX is independent ofwhether the failureYj causes the failure ofX. This is a com-mon assumption in fault localization techniques proposed inthe literature [11, 25, 41, 42]. Assuming the independence ofcontributions, uncertainty may be modeled by assigning prob-ability matrices to the dependency graph edges rather than toits nodes. Thus, the(n+1)-dimensional probability matrix ofsize|SX | × |SY1 | × . . . × |SYn | associated with nodeX is re-placed byn two-dimensional matricesPYi,X of sizes|SX | ×|SYi

| associated with edgesX→Y1, . . ., X→Yn, respectively,wherePYi,X(sX , sYi

) = P{X is in statesX | Yi is in statesYi}. The application of both independence properties 1 and

2 reduces the probability matrix associated with a dependencygraph edgeX→Yi to a two-dimensional matrixP2

Yi,Xof sizes

|FX | × |SYi |, whereP2Yi,X

(FX,i, sYi) = P{X is in failuremodeFX,i | Yi is in statesYi

} =∑

sX∈SX(i) PYi,X(sX , sYi

).3. Service or function failure modes independently contributeto a failure of a dependent service or functionFinally, we observe that in many real-life situations, differentfailure modes of service or function associated with nodeYi

independently affect the dependent service or function associ-ated with nodeX. In other words, if failuresFYi,j andFYi,k

of service of function represented by nodeYi may cause fail-ure FX,l, then whetherFYi,j causesFX,l is independent ofwhetherFYi,k causesFX,l. In general, such a refinement isreasonable and consistent with other attempts to model faultpropagation between components [42]. Using this propertywe can model uncertainty by assigning to nodeX an (n+1)-dimensional matrixP3

X of sizeSX × |FYi| × . . . × |FYn

|,

4

whereP3X(sX , FY1,i, . . . , FYn,j) = P{X is in statesX | Y1 is

in failure modeFY1,i ∧ . . . ∧ Yn is in failure modeFYn,j}.

Clearly, the application of observations 1, 2, and 3 al-lows the probability matrices to be reduced to polynomialsize. The weight assigned to dependency linkX→Yk isa two-dimensional matrix|FX | × |FYk

|, PYk,X , such thatPYk,X(FX,j , FYk,i) = P{ service or functionX is in failuremodeFX,j | service or functionYk is in failure modeFYk,i},whereFX,j ∈ FX andFYk,i ∈ FYk

.

2.3 Obtaining the dependency graph

The layered fault propagation model records two types of de-pendencies between services and functions in neighboring pro-tocol layers:static anddynamicdependencies. Static depen-dencies result from, e.g., standardized definition of functionsprovided by different layers of the protocol stack, or fromstatic network topologies. Dynamic dependencies result from,e.g., run-time addition and deletion of services (such as estab-lishment and termination of TCP sessions), dynamic routing,etc. Clearly, it is easier to obtain static dependencies than dy-namic ones, since they do not change during system run-timeor change only as a result of human intervention; Therefore,the model recording static dependencies may be built or modi-fied manually. However, since the manual process may becomeintricate and error-prone in large and complex systems, auto-mated techniques of obtaining static dependencies have beeninvestigated. The network topology may be obtained auto-matically through various network topology detection mech-anisms, which are built into some commercially available net-work management systems [76]. Other management systemsimplement proprietary techniques that allow the discovery ofhardware configuration changes such as addition or removal ofa network adapter or host [24]. IETF has recently recognized aneed for a standardized means of representing the physical net-work connections by proposing thePhysical TopologyMIB [5], which can be used to obtain topology information ifit is implemented in the managed domain.

Obtaining dynamic dependencies is significantly more difficultsince the dependency model has to be continuously updatedwhile the modeled system is running. In spite of that, manytools exist that facilitate the process of model building. Thesetools are usually specific to particular network services or func-tions. For example, all active TCP connections on a host maybe retrieved using thenetstatapplication [75]. A current routebetween a host and any other host may be obtained using pro-gramtraceroute[75].

Network management protocols such as SNMP [9] provide ameans to determine dependencies established using configura-tion or real-time routing protocols. For example, the manage-ment system may obtain the topology, which was dynamicallyestablished in the data-link layer by the Spanning Tree Pro-tocol [61] using the data contained indot1dBase Group ofBridge MIB [19]. Updates of the spanning tree may be trig-gered bynewRoot andtopologyChange traps [19]. In thenetwork layer of the Internet, current routes may be calculatedfrom ipRoutingTable of TCP/IP MIB-II [49]. In sourcerouting protocols (e.g., Source-Directed Relay of the military

protocol MIL-STD 188-220 [21] and Dynamic Source Rout-ing [35] proposed for wireless mobile networks), the currentroute between two hosts is embedded in the header of everytransmitted packet.

Other techniques of obtaining network topology have been alsoinvestigated. To monitor hierarchical network topology, No-vaes [55] uses IP multicast. Siamwalla et al. [69] propose sev-eral heuristics that exploit SNMP, DNS, ping, and traceroutefacilities to discover the network level topology of the Inter-net on both intra-domain and backbone levels. Govindan etal. [28] infer the Internet map using hop-limited probes. Reddyet al. [63] investigate a distributed topology discovery tech-nique for the Internet. Breitbart et al. [7] present an algorithmthat obtains both network and data-link layer topology usingTCP/IP MIB-II [49] andBridge MIB [19]. The algorithm ofLowekamp et al. [48] discovers the topology of a large Ether-net network allowing incomplete data inBridge MIBtables.

Several techniques have been proposed that are tailored towarddiscovering relationships among services and software com-ponents. An automated technique of detecting static depen-dencies within software components on a single machine isproposed in [37]. Ramanathan et al. [62] propose techniquesthat obtain relationships among services within ISP systems.Brown et al. [8] and Bagchi et al. [3] describe an active ap-proach based on controlled system perturbations designed fordiscovering dependencies among components of a typical e-commerce application. Ensel [23] applies neural networks toobtain relationships among the defined set of system compo-nents based on measurable values describing the components’activities and performance (e.g., CPU usage, bandwidth con-sumption, etc.). Hasselmeyer [30] argues that the dependenciesamong distributed cooperating components should be main-tained and published by services themselves, and proposes aschema that allows these dependencies to be obtained.

Despite all the methods cited in this section, it has to be ob-served that obtaining dependency information in an automaticfashion is still an open research problem, whose solution is be-yond the scope of this paper. The amount of effort currentlyinvested in solving it, however, lets us anticipate that manynew, more flexible, and more comprehensive techniques willbe proposed in the near future.

3 BELIEF NETWORK CONCEPTS

We now present the basic concepts of graph and belief networktheory used in the next sections of this paper.

A directed graphis a pairG=(V,E), whereV is a set of nodesandE={(Vi, Vj) | Vi, Vj∈V } is a set of edges. We will denoteby N the number of nodes in graphG, i.e., N=|V |. If edge(Vi, Vj) belongs to graphG, we say thatVi is aparentof Vj ,andVj is a child of Vi. If both (Vi, Vj) and (Vj , Vi) belongto E we say thatG contains an undirected edge betweenVi

andVj ; we also say thatVi andVj areneighbors. A directedacyclic graph(DAG) is a directed graph with no directed cy-cles. Anundirected graphis a graph that contains only undi-rected edges.

An ordered graphis a pair(G, o), whereG is an undirected

5

graph ando=V1, V2, . . . , VN is an ordering of the nodes. Inthe ordered graph, the number of neighbors of nodeVi thatprecede it in the ordering is called thewidth of nodeVi. Amodified ordered graph(GM , o) of an ordered graph(G, o) iscreated as follows: (1) the nodes of graphG are processed ac-cording to orderingo from last to first; (2) while processingnodeVi, all its neighbors that precede it in the order are con-nected to one another using undirected links. Thewidth of anordered graph(G, o), w∗(o), is the maximum node width inthe modified ordered graph(GM , o). Theinduced width of thedirected graphG, w∗, is the minimum width of(G, o) com-puted over all orderingso [17].

Themoral graphof a directed graphG is obtained by introduc-ing additional undirected edges between any two nodes with acommon child and then converting all directed edges into undi-rected ones [14].

A belief network[17, 60] is a directed acyclic graph (DAG),in which each node represents a random variable over a mul-tivalued domain. We will use terms “node” and “randomvariable” interchangeably, and denote them byVi. The setof all nodes is denoted byV . The domain of random vari-able Vi will be represented by symbolDi. The set of di-rected edgesE denotes an existence of causal relationshipsbetween the variables and the strengths of these influencesare specified by conditional probabilities. Formally, a beliefnetwork is a pair(G, P ), whereG is a DAG, P={Pi}, andPi is the conditional probability matrix associated with a ran-dom variableVi. Let Par(Vi)={Vi1 , . . . , Vin

} be the set ofall parents ofVi. Pi is a (|Par(Vi)|+1)-dimensional matrixof size |Di|×|Di1 |×. . .×|Din |, wherePi(vi, vi1 , . . . , vin)==P (Vi=vi | Vi1=vi1 , . . . , Vin

=vin). We will denote by

A={V1=v1,. . . ,Vn=vn} an assignment of values to variablesin set V where eachvj∈Dj . We will use vAj to denotethe value of variableVj∈V in alignmentA. Given a subsetof random variablesUk={Vk1 , . . . , Vkm}⊆V , we will denoteby UA

k ={Vk1=vAk1, . . . , Vkm

=vAkm} an assignment of values to

variables in setUk that is consistent with assignmentA. Anevidence sete is an assignmentUA

o , whereUo⊆V is a set ofvariables whose values are known, and for eachVoj

∈Uo, vAojis

its observed value.

Belief networks are used to make four basic queries given ev-idence sete: belief assessment, most probable explanation,maximum a posteriori hypothesis, and maximum expected util-ity [17]. The first two queries are of particular interest inthe presented research. Thebelief assessmenttask is to com-pute bel(Vi=vi)=P (Vi=vi|e) for one or more variablesVi.The most probable explanation(MPE) task is to find an as-signmentAmax that best explains the observed evidencee,i.e., P (Amax)=maxA Πn

i=1P (Vi=vAi | Par(Vi)A) [17]. Itis known that these tasks are NP-hard in general belief net-works [13]. A belief updating algorithm, polynomial with re-spect to|V |, is available forpolytrees, i.e., directed graphswithout undirected cycles [60]. However, in unconstrainedpolytrees, the propagation algorithm still has an exponentialbound with respect to the number of a node’s neighbors.

Since exact inference in belief networks is NP-hard, variousapproximation techniques have been investigated [18, 60, 64].

To the best of our knowledge, no approximation has been pro-posed that works well for all types of networks. Moreover,some approximation schemes have been proven to be NP-hard [15].

In this paper, we focus on a class of belief networks repre-senting a simplified model of conditional probabilities callednoisy-OR gates[60] (or QMR networks [14]). The simplifiedmodel contains binary-valued random variables. The noisy-OR model associates an inhibitory factor with every cause ofa single effect. The effect is absent only if all inhibitors cor-responding to the present causes are activated. The model as-sumes that all inhibitory mechanisms are independent [31, 60].This assumption of independence is ubiquitous in probabilisticfault localization approaches reported in the literature [41, 42].It indicates that all alternative causes of the same effect areindependent, which is consistent with the refinements we dis-cussed in Section 2. Recall that these refinements help avoidexponential time and memory otherwise needed to process andstore conditional probability matrices associated with randomvariables in the belief network. Furthermore, belief assessmentin polytrees with the noisy-OR model has polynomial com-plexity, which makes it attractive as an approximation schemato solve the fault localization problem.

4 MAPPING LAYERED MODEL INTO BELIEF NETWORK

The general layered dependency graph introduced in Section 2may be converted into a belief network simply by reversing itsedges. Every dependency graph node,X, becomes a belief net-work node,Vi. The domain of nodeVi, Di=SX , and the cor-responding probability matrixPi=PX . Recall from Section 2that the creation of such a general dependency graph is notfeasible in real-life systems because of the exponential sizesof the node state spaces and probability matrices. We definedand explained three properties that allowed this complexity tobe avoided. In the simplified layered dependency graph, prob-ability matrices are assigned to the dependency graph edgesrather than to its nodes. The probability matrixPYk,X assignedto the dependency linkX→Yk is a two-dimensional matrix ofsize |FX |×|FYk

|, such thatPYk,X(FX,j , FYk,i)=P{ serviceor functionX is in failure modeFX,j | service or functionYk

is in failure modeFYk,i}, whereFX,j∈FX and FYk,i∈FYk.

The belief network corresponding to the simplified layered de-pendency graph is built as follows.

• For every node of the layered dependency graph and forevery failure mode associated with this node, we create arandom variable, whose domain is{true, false}. Let Vi

be a belief network node created for failure modeFX,j

of the dependency graph node representingServiceL(a,b)or Network FunctionL(a). AssignmentVi =true indicatesthat ServiceL(a,b) or Network FunctionL(a) is in conditionFX,j . AssignmentVi =falsethatServiceL(a,b) or NetworkFunctionL(a) is NOT in conditionFX,j .

• For every dependency graph edgeX→Yk and for every fail-ure mode of nodeYk, FYk,i, determineFX,j , the failuremode of nodeX that results from conditionFYk,i in nodeYk. Let Vi be the belief network node corresponding to de-pendency graph nodeYk and failure modeFYk,i. Let Vj bethe belief network node corresponding to dependency graph

6

nodeX and failure modeFX,j . Add a belief network edgepointing fromVi to Vj .

• Let PYk,X be the probability matrix associated with depen-dency link X→Yk. The probability matrixPj associatedwith nodeVj represents the following conditional probabil-ity distribution.P (Vj=false| Vi=false)= 1P (Vj=false| Vi=true) = 1−PYk,X(Fi, Fj)P (Vj=true | Vi=false)= 0P (Vj=true | Vi=true) = PYk,X(Fi, Fj)

The belief network resulting from the mapping of the depen-dency graph presented in Figure 1 consists of one or more,possibly overlapping, belief networks of the shape presentedin Figure 2.

FNetworkFunctionsL+1(a), i

FServiceL+1(a,b), l

FNetworkFunctionsL(a), j

FNetworkFunctionsL−1(a), k

FNetworkFunctionsL+1(b), i

FNetworkFunctionsL(b), j

FNetworkFunctionsL−1(b), k

FServiceL(a,b), m

FServiceL−1(a,b), n

Figure 2: Belief network built for the dependency graph in Fig-ure 1.

To complete the mapping of the fault localization problem intothe problem of computing queries in belief networks, we needto define the interpretation of faults and symptoms in the do-main of belief networks. A symptom is defined as an observa-tion that a dependency graph nodeX, which typically corre-sponds to a higher-level service, is in conditionFX,j (negativesymptom), or is NOT in conditionFX,j (positivesymptom).We will denote byS the set of all possible symptoms. IfVi

is the belief network node corresponding to the dependencygraph nodeX and its failure modeFX,i, then the negativesymptom is interpreted as an instantiation ofVi with valuetrue,and the positive symptom is interpreted as an instantiation ofVi with valuefalse. The set of all observed negative symptomsand the set of all observed positive symptoms will be denotedby SN andSP , respectively. Thus, as a result of this map-ping, the set of all observed symptoms, which will be denotedby SO=SN∪SP⊆S, becomes the evidence sete. Note thatin general,SN∪SP 6=S, as some symptoms may be unobserv-able. If a symptom is unobservable, e.g., as result of the currentmanagement system configuration, the lack of negative obser-vation may not be interpreted as a positive observation. Theratio |SO|/|S| will be calledobservability ratio(OR). The be-lief network in whichSO represents the set of all observablesymptoms will be denoted byBN (SO).

The dependency graph nodeX, which corresponds to a lower-level service or function, is at fault if it is in any of the condi-tions inFX , say conditionFX,i. The set of all possible faults

is denoted byF . The fact that the service or function corre-sponding toX is in failure modeFX,i is represented by valuetrue in the domain of the random variableVi. The problem offinding the set of faults,Fc ⊆ F that best explains the set ofobserved symptomsSO may be solved by computing the MPEquery in belief networkBN (SO) based on the evidence sete.

5 FAULT LOCALIZATION TECHNIQUES

In this section, we address the problem of finding the set ofroot problems that best explains the set of observed symptomsusing a belief network as a fault propagation model. In general,the problem is known to be NP-hard [13]—the exact calcula-tion of the best explanatory hypothesis requires a number ofsteps that is exponential with respect to the number of graphnodes. One of the most popular exact algorithms–bucket elim-ination [17] is presented in Section 5.1. We use this algorithmas a reference algorithm against which algorithms introducedin this paper are compared. In Sections 5.2 and 5.3, we presenttwo algorithms derived from Pearl’s iterative propagation inpolytrees [60], which we adapt to solving the fault localiza-tion problem in arbitrary belief networks. These algorithmswere first introduced in [74]. The first algorithm, which ispresented in Section 5.2, utilizes iterative belief updating tocalculate the marginal posterior probability given the observedevidence. From this probability distribution we derive an ap-proximation of the most probable explanation of the evidence.The second algorithm (Section 5.3) applies iterative calcula-tion of the MPE query for polytrees [60] to arbitrary networks.Since the original algorithm [60] has exponential complexitywith respect to the maximum node degree, we introduce anapproximation that allows the calculations to be performed inpolynomial time.

5.1 Bucket elimination algorithm

Bucket eliminationproposed in [17] (Algorithm 1 , see Ap-pendix A) is one of the most popular algorithmic frame-works for computing queries in belief networks. Appendix Adiscusses how this algorithm computes the most-probable-explanation query.

The bucket eliminationalgorithm for computing MPE is ex-act and always outputs a solution. We consider it the optimalalgorithm for computing the explanation of the observed symp-toms. The computational complexity of the algorithm is boundbyO(|V |exp(w∗(o))), wherew∗(o) is the width of the graphinduced by orderingo, defined in Section 3.

5.2 Iterative belief updating

Recall from Section 3 that in singly-connected networks (poly-trees) representing the noisy-OR-gate model of conditionalprobability distribution, Bayesian inference (belief updating)may be computed in polynomial time using the algorithm pre-sented in [60]. Clearly, the belief network in Figure 2 is not apolytree because it contains undirected loops.

Networks with loops violate certain independence assumptionsbased on which the local computation equations were derivedfor polytrees. As suggested in [60], the iterative algorithmapplied to loopy networks may or may not converge. Never-

7

theless, successful applications of the iterative algorithm havebeen reported. The most famous of them are Turbo-Codes [4]that offer near Shannon limit correcting coding and decoding.The Turbo-Codes decoding algorithm was shown to be an in-stance of iterative belief propagation in polytrees applied toloopy networks [50]. Other compound codes were also for-mulated as a problem of belief propagation in graphical mod-els [43]. Previous applications of a deterministic decodingschema to deterministic fault localization [80] inspire the ap-plication of probabilistic decoding to fault localization in non-deterministic fault models.

The effectiveness of iterative propagation in loopy networksand its near-optimal accuracy came as a surprise to the researchcommunity [65]. While there is no theoretical explanation tothese results, some empirical research has been performed todetermine which properties of graphs make it more likely toachieve high accuracy while applying iterative propagation,and when no convergence can be achieved. In [52] the per-formance of iterative propagation in various types of graphsis investigated. It is concluded that, while iterative propaga-tion may offer close-to-optimal accuracy for many types ofnetworks, there are properties of conditional probability dis-tributions that make their corresponding Bayesian networksprone to oscillations when the iterative algorithm is applied.Low prior probabilities (i.e., values in probability matrices ofparent-less nodes) and small conditional probabilities associ-ated with the causal links seem to be contributing factors af-fecting the lack of the iterative algorithm’s convergence [52].

The impact of the low prior probabilities on the applicabilityof iterative propagation to loopy networks is discouraging be-cause prior probabilities in the fault localization task, whichcorrespond to independent fault occurrence probabilities, arevery small. In spite of that, our research investigates the itera-tive propagation technique in bipartite fault graphs. The reasonfor this is that we are not interested in the precise values of thebelief metric; as long as the relative values are preserved wecan still hope to achieve a good solution. In Section 7 we showthe encouraging results of this investigation.

Recall from Section 4 that the problem of fault localizationmay be translated into the most probable explanation (MPE)query in belief networks. The iterative algorithms for polytreesproposed in [60] include the algorithm for calculating MPE.Nevertheless, we start presenting iterative algorithms from thedescription of belief updating, which is conceptually simpler.We also present an adaptation of belief updating to estimatingthe MPE.

5.2.1 Iterative belief propagation concepts

Iterative belief propagation utilizes a message passing schemain which the belief network nodes exchangeλ andπ messages(Figure 3). MessageλX(vj) that nodeX sends to its parentVj for every validVj ’s valuevj , denotes a posterior probabil-ity of the entire body of evidence in the sub-graph obtained byremoving linkVj→X that containsX, given thatVj=vj . Mes-sageπUi

(x) that nodeX sends to its childUi for every validvalue ofX, x, denotes a probability thatX=x given the en-tire body of evidence in the subgraph containingX created by

removing edgeX→Ui. In this section, we present a summaryof the iterative algorithm for polytrees. The complete descrip-tion of the iterative algorithm for polytrees along with someillustrative examples may be found in [60].

X

UnUiU1

V1 Vj Vm

λU1(x)

λUi(x)

λUn(x)

λX(v1) λX(vj) λX(vm)

πX(v1)

πX(vj)

πX(vm)

πU1(x) πUi

(x) πUn(x)

λ(x) ?, π(x) ?bel(x) ?

Figure 3: Message passing in Pearl’s belief propagation

Based on the messages received from its parents and children,nodeX computesλ(x), π(x), andbel(x) as follows [60]:

λ(x) =n∏

i=1

λUi(x) (1)

π(x) ={

α∏m

j=1(1− cVjXπjX) if x = 0α(1−

∏mj=1(1− cVjXπjX)) if x = 1 (2)

bel(x) = αλ(x)π(x) (3)

In the above equations,πjX=πX(vj) for vj=1, α is a normal-izing constant, andβ is any constant. In a noisy-OR polytree,let us denote byqXUi

the probability of activating the inhibitorcontrolling link X→Ui. Every random variable assumes val-ues from{0, 1}, where 1 denotes occurrence of the correspond-ing event and 0 means that the event did not occur. The prob-ability thatUi occurs givenX occurs iscXUi

=1−qXUi. The

messagesλX(vj) andπUi(x) are computed using the follow-

ing equations [60]:

λX(vj)=β(λ(1)− q

vj

VjX(λ(1)− λ(0))∏k 6=j

(1−cVkXπkX))

(4)

πUi(x)=α

∏k 6=i

λUk(x)π(x) (5)

In the initialization phase, for all observed nodesX, λ(x) isset to 1 ifx is the observed value ofX. For other values ofx, λ(x) is set to 0. For all unobserved nodesλ(x) is set to 1for all values ofx. Parentless nodes have theirπ(x) set to theprior probabilities. The belief propagation algorithm in poly-trees starts from the evidence node and propagates the changedbelief along the graph edges by computingbel(x), λX(vi)sandπX(ui)s in every visited node. In loopy graphs, severaliterations are performed in which the entire graph is searchedaccording to some pre-defined ordering.

5.2.2 Application of belief propagation to fault localization

In the domain of fault localization, observations of networkdisorder (symptoms) are considered belief network evidence.Symptom observation is represented by assigning the corre-

8

sponding belief network node to 1, which means that the symp-tom did occur. Belief network nodes whose correspondingsymptoms were not observed are left unassigned (i.e., theirλ(0)=λ(1)=1). The fault localization task is to find an assign-mentFAmax . For fi∈F , assignmentfi=1 indicates that faultfi occurred, andfi=0 indicates that faultfi did not occur.

The fault localization algorithm utilizing iterative belief propa-gation starts with a belief network all of whose evidence nodescorresponding to observable symptoms are assigned to 0, andall other nodes are unassigned. Then the algorithm proceedsiteratively, after every symptom observation applying one iter-ation of belief updating traversing the graph according to someorder. For every symptom we define a different ordering thatis equivalent to the breadth-first order started from the noderepresenting the observed symptom. The initialization phaseconsiders all observable symptoms positive and calculates faultprobability distribution in the presence of no negative observa-tions. When a negative symptom is observed the propagationof negative evidence reverses the results of the correspond-ing positive symptom analysis performed in the initializationphase.

Clearly, results of negative symptom analysis may also be re-versed when the symptom is canceled or corresponding pos-itive symptom is observed. Moreover, the set of observablealarms may be easily modified during the fault localizationprocess by redefining the values of functionλ of nodes whoseobservability status has changed.

It may be noticed that in the unobserved symptom nodes, be-cause theirλ(x)=1 for any value ofx, λX(vj)=1 regardlessof the values of other messages used in the expression. Sincesymptom nodes have no children, there is no need to computeπmessages. Because of these properties, calculations performedin an unobserved symptom node do not change any of the mes-sages published by the node. As a result, an unobserved symp-tom node acts as a barrier through which belief does not prop-agate. Therefore, the fault localization algorithm does not visitunobserved symptom nodes, nor does it explore subgraphs ac-cessible only through unobserved symptom nodes.

For a positive symptom nodeX, and for any parent node ofX,Vj , observe that:

λX(vj) = βqvj

VjX

∏k 6=j

(1−cVkXπkX)

Since β is any constant, assignmentβ=α=1/(λX(Vj=1)+λX(Vj=0)) leads to the followingequation.

λX(vj) ={

1/(1 + qVjX) if vj = 0qVjX/(1 + qVjX) if vj = 1

SinceλX(vj) does not depend on the received values of func-tionsπX(vk), nodeX does not propagate evidence among itsparents. As a result, the iterative belief updating need not con-tinue past a positive symptom node.

The iterative belief propagation described above allows us toobtain the marginal posterior distribution resulting from theobservation of the evidence (symptoms). We use this distri-

bution to estimate the set of faults that are the most probablecauses of the observed symptoms. For this purpose, we intro-duce the following heuristics: (1) Choose a fault node with thehighest posterior probability, (2) Place the corresponding faultin the MPE hypothesis, (3) Mark the node as observed withvalue 1, and (4) Perform one iteration of the belief propagationstarting from the chosen fault node. Steps (1)-(4) are repeatedas long as (1) the posterior distribution contains fault nodeswhose probability is greater than 0.5, and (2) unexplained neg-ative symptoms remain.

Observe that an inherent property of Algorithm 2 is the capa-bility to isolate multiple simultaneous faults even if their symp-toms overlap. Observe also that the algorithm does not neces-sarily provide an explanation to all observed symptoms. Whennone of the available alternative hypotheses is associated withsufficient belief (i.e., 0.5), the risk of giving an incorrect an-swer by proposing one of the hypotheses is high. We favor notexplaining some symptoms over proposing a hypothesis that islikely to contain non-existent faults. Formally, the fault local-ization algorithm is defined as follows.

Algorithm 2 (Fault localization through iterative belief updating)

Inference iteration starting from node Yi:let o be the breadth-first order starting fromYi

for all nodesX along orderingo doif X is not an unobserved or positive symptom node then

computeλX(vj) for all X ’s parents,Vj ,and for allvj ∈ {0, 1}

computeπUi(x) for all X ’s children,Ui,and for allx ∈ {0, 1}

Initialization phase:for every symptomSi ∈ SO do

markSi as observed to have value of 0run inference iteration starting fromSi

Symptom analysis phase:for every symptomSi ∈ SN do


computebel(vi) for every nodeVi, vi ∈ {0, 1}Fault selection phase:

while∃ fault nodeVj for whichbel(1)>0.5 andSN 6= ∅ dotakeVj with the greatestbel(1)markVj as observed to have value of 1remove all symptoms explained byVj fromSN

run inference iteration starting fromVj

computebel(vi) for every nodeVi, vi ∈ {0, 1}

Note that, contrary to other approaches to fault localiza-tion [41, 42, 25], which delay symptom analysis until all symp-toms are collected, Algorithm 2 does not require all symptomsto be observed before their analysis may be started. On thecontrary, it analyzes a symptom independently of other symp-tom observations. The knowledge resulting from analyzing asymptom is stored for the next iterations in the belief networknodes in the form ofλ andπ messages, allowing Algorithm 2to utilize time more efficiently. Moreover, for every fault thealgorithm continuously provides the probability of its occur-rence given the symptoms observed so far.

Local computations in nodes requireO(d) operations, whered is the maximum node degree. A single iteration visits every

9

node at most once; therefore its computational complexity isO(|V |(d))=O(|E|). During the entire computation, at most|SO|+|F| iterations are performed, where usually|F|�|SO|.Thus, the complexity of the entire algorithm isO(|SO||E|).

5.3 Iterative most probable explanation

In this section, we present the application of the iterative MPEalgorithm for polytrees [60] to networks with undirected loops.The iterative belief updating algorithm presented in Section 5.2computes the marginal posterior probability of the Bayesiannetwork variable values given the observed evidence. In Al-gorithm 2, we used this distribution to select the most prob-able explanation. The MPE algorithm in every iteration pro-duces the most probable value assignment to the belief networknodes. This allows us to eliminate thefault selectionphase inAlgorithm 2, which contributes to the complexity and is an ad-ditional source of the inaccuracy.

5.3.1 Iterative MPE concepts

Similarly to belief updating, the MPE computation algorithmproceeds from the evidence nodes by passing messagesλ∗ andπ∗ along the belief network edges. Messageλ∗X(vj) sent bynodeX to its parentVj represents the conditional probabilityof the most probable prognosis for the values of nodes locatedin the subgraph containingX resulting from the removal of thelink Vi→X, given the propositionVi = vj . Messageπ∗Ui

(x)sent by nodeX to its childUi represents the probability of themost probable values of the nodes located in the subgraph con-tainingX resulting from the removal of linkX → Ui, whichinclude the propositionX = x. The belief metricbel∗(x)stands for the probability of the most probable explanation ofevidencee that is consistent with the propositionX = x. Mes-sagesλ∗X(vj) andπ∗Ui

(x), and belief metricbel∗(x) are com-puted using the following equations [60]:

λ∗X(vj)= maxx,{vk 6=vj}

( ∏i

λ∗Ui(x)P (x|v1, . . . , vm)∏

k 6=j

π∗X(vk))

(6)

π∗Ui(x)=

∏k 6=i

λ∗Uk(x)

max{vk}

(P (x|v1, . . . , vm)

∏k

π∗X(vk))

(7)

bel∗(x)=β∏k

λ∗Uk(x)

max{vk}

(P (x|v1, . . . , vm)

∏k

π∗X(vk))

(8)

Using notation from Section 5.2, the maximization may be ex-pressed as follows:

max{vk}

(P (x|v1, . . . , vm)

∏k

π∗X(vk))

= (9)max{vk}

((∏

Vk|vk=1 qVkX)∏

k π∗X(vk))

if x = 0

max{vk}

((1−

∏Vk|vk=1 qVkX)

∏k π∗X(vk)

)if x = 1

5.3.2 Application of iterative MPE to fault localization in be-lief networks with loops

The calculation ofλ∗s, π∗s, andbel∗s is the primary diffi-culty in using iterative MPE in practical applications. Whilefor x=0 expressionmax{vk}(P (x|v1, . . . , vm)·

∏k π∗X(vk))

may be simplified to∏

k max(π∗X(vk=0), qVkXπ∗X(vk=1)),the exact computation of the maximization forx=1 requiresenumerating all possible combinations of value assignments tothe parents ofX, and choosing a combination that maximizesthe value of the expression. Clearly, listing all combinationsis computationally infeasible. In this paper, we propose an ap-proximation that allows us to compute the maximization ex-pression in polynomial time. The approximation is presentedin Appendix B.

The algorithm for computing MPE calculatesλ∗ andπ∗ forevery network node traversing the graph starting from the ob-served symptom in a breadth-first order. A single traversal isrepeated for every observed symptom. At the end,bel∗ val-ues are computed for all network nodes. The MPE contains allfault nodes withbel∗(x=1)>bel∗(x=0).

Algorithm 3 (Fault localization through iterative MPE)

Inference iteration starting from node Yi:let o be the breadth-first ordering starting fromYi

for all nodesX along orderingo docomputeλ∗X(vj) for all X ’s parents,Vj ,

and for allvj ∈ {0, 1}computeπ∗Ui

(x) for all X ’s children,Ui,and for allx ∈ {0, 1}

Initialization phase:for every symptomSi ∈ SO do


Symptom analysis phase:for every symptomSi ∈ SN do


computebel∗(vi) for every nodeVi, vi ∈ {0, 1}Fault selection phase:

choose all fault nodes withbel∗(X=1)>bel∗(X=0)

Local computations in nodes requireO(d2) operations, whered is the maximum node degree. There are|SO| iterationsvisiting every belief network node at most once. This leadsto the computational complexity of the entire algorithm ofO(|SO||V |d2).

6 DEALING WITH NOISY OBSERVATIONS

In real-life communication systems, an observation of networkstate is frequently disturbed by the presence of lost and/or spu-rious symptoms (usually referred to as observation noise).

In a management system, alarms may be lost as a result of us-ing an unreliable communication mechanism to transfer alarmsfrom their origin to the management node. For example, sincethe SNMP protocol [10] exploits an unreliable transport layerprotocol (UDP), SNMP traps [10] issued by an SNMP agentare not guaranteed to be delivered to the destination. Manage-ment agents on network devices monitor values of various per-formance metrics, and issue alerts when the monitored metrics

10

violate pre-set threshold values. Too liberal threshold valuesmay prevent an existing problem from being reported, therebycausing alarm loss. Given the popularity of unreliable manage-ment protocols such as SNMP and the difficulty of calculatingthe correct threshold values, the probability of an alarm lossis not negligible. When the fault localization algorithm relieson positive information to create the most likely fault hypoth-esis, alarm loss, if ignored by the algorithm, could lead to anincorrect solution.

Another frequent disturbance in an observation of networkstate is due to spurious alarms, which are caused by intermit-tent network faults or by overly restrictive threshold values.Intermittent faults are the ones that result in observable symp-toms, but no longer exist at the time of the symptoms’ corre-lation. As with all fault types, the symptom patterns triggeredby the intermittent faults and their probability of occurrencemay be identified through the analysis of historical data such asalarm log files. When such identification is possible, the inter-relation among spurious symptoms may be taken into account.In this case, the intermittent faults can be modeled similar toordinary faults. Therefore, no change of either the fault prop-agation model, or the fault localization algorithm is needed.Oftentimes, in particular for spurious alarms due to overly re-strictive detection mechanisms, the determination of spurioussymptoms interdependence may be impossible. The methodproposed in this section focuses on symptoms for which inter-dependence information may not be learned.

We address the problem of lost and spurious alarms byaugmenting the belief network model presented in Sec-tion 2 using the technique we introduced in [73]. LetVS={VS1 , . . . , VSm

}⊂V , where m=|S|, be the set ofall belief network nodes which correspond to symptoms{S1, . . . , Sm}=S. We introduce a set of belief network nodes{V ′

S1, . . . , V ′

Sm}, which represent unobservable failures. Then

for every nodeVSi∈VS and for everyVj∈V−VS such that

(Vj , VSi)∈E we (1) remove(Vj , VSi

) from E and (2) add(Vj , V

′Si

) to E. Then, we add directed edgesV ′Si→VSi ,

i=1 . . .m. With every directed edgeV ′Si→VSi we associate

the probability of causal relationship betweenV ′Si

and VSi

equal to1−ploss(Si), whereploss(Si) is the probability thatalarmSi is lost. The values ofploss(Si) may be obtained, forexample, by analyzing packet loss rate in the network used totransport symptomSi from its origin to the management sta-tion.

To model spurious symptoms, we introduce nodesV ∗Si

,i=1 . . .m. Then, we add directed edgesV ∗

Si→VSi

. Withevery V ∗

Siwe associate prior beliefpspurious(Si) that repre-

sents the cumulative probability of events (other than persistentfaults) that trigger alarmSi. The value ofpspurious(Si) may belearned by analyzing historical alarm log files. Every directededgeV ∗

Si→VSi

is labeled with1−ploss(Si).

The resultant belief network, which will be denoted byBN (SO, ploss, pspurious), is presented in Fig. 4. Observe that,when ploss(Si)=0, edgesV ′

Si→VSi

(for i=1 . . . Ns) are re-dundant, and nodesV ′

SiandVSi

may be considered identical(V ′

Si=VSi

). Also, whenpspurious(Sj)=0, nodesV ∗Sj

are redun-dant and may be reduced (V ∗

Sj=VSj

). Thus,BN (SO, 0, 0) is

VS1 VSi VSm... ...

VS1 VSiVSm

... ...* * *

VS1 VSi VSm... ...

BN(SO)

pspurious(S1)

1−ploss(S1)

pspurious(Si) pspurious(Sm)

1−ploss(Si) 1−ploss(Sm)

1−ploss(S1) 1−ploss(Si) 1−ploss(Sm)

’ ’ ’

Figure 4: A belief network modeling lost and spurious symp-toms (BN (SO, ploss, pspurious))

equivalent to the belief networkBN (SO) introduced in Sec-tion 4. Belief networkBN (SN , 0, 0) is equivalent toBN (SN )which is the fault propagation model that does not take ei-ther positive, lost, or spurious symptoms into account. Whenthe existence of either lost or spurious (but not both) alarmsmay be neglected, one should useBN (SO, 0, pspurious) orBN (SO, ploss, 0), respectively.

To perform fault localization that includes lost and spurioussymptoms in the analysis usingBN (SO, ploss, pspurious) as afault propagation model algorithms proposed in Section 5 maybe applied.

7 END-TO-END SERVICE FAILURE DIAGNOSIS–A

SIMULATION STUDY

In this section, we describe the application of fault localizationtechniques proposed in Sections 2- 6 to the diagnosis of end-to-end service failures. Network connectivity in a given protocollayer is frequently achieved through a sequence of intermediatenodes invisible to the layers above. A failure of an intermedi-ate node may cause availability or performance problems onone or more end-to-end paths established using the malfunc-tioning node. End-to-end service failure diagnosis deals withisolating causes of failures associated with end-to-end paths.Diagnosing end-to-end service failures, both availability andperformance related ones, is a crucial step toward multi-layerfault localization. In this paper, end-to-end service failure di-agnosis is used as a case study through which the performanceand accuracy of algorithms introduced in Section 5 are evalu-ated.

7.1 End-to-end service model

The layered fault propagation model presented in Figure 1 inSection 2 seems only to represent services provided by point-to-point network connections ignoring the fact that networkconnectivity betweena andb in layerL is likely to be achievedthrough a sequence of intermediate nodes such as bridges inthe data-link layer or routers in the network layer. To representthis fact in the system model, a micro-model ofServiceL(a,b)should be introduced.

11

When connectivity betweena and b in layer L is achievedthrough a sequence of intermediate nodes, we say that theend-to-end service offered by layerL between hostsa andbis implemented in terms of multiple host-to-host services of-fered by layerL between subsequent hops on the path of thelayer L packet from nodea to nodeb. Figure 5 presents thelayered fault propagation model including a micro-model ofServiceL(a,b)provided using one intermediate hostc.

ServiceL−1(a,c)

ServiceL(a,c)

ServiceL+1(a,b)

Layer L+1

Network FunctionsL+1(a)

Layer L

Network FunctionsL(a)

Layer L−1

Network FunctionsL−1(a)

Layer L

Network FunctionsL(c)

Layer L−1

Network FunctionsL−1(c)

ServiceL−1(c,b)

ServiceL(c,b)

Layer L+1

Network FunctionsL+1(b)

Layer L

Network FunctionsL(b)

Layer L−1

Network FunctionsL−1(b)

ServiceL(a,b)

Figure 5: Layered network dependency model including anend-to-end service micro-model.

The graph connecting an end-to-end service to its implement-ing host-to-host services constitutes the micro-model of theend-to-end service. Micro-models of different end-to-end ser-vices provided in layerL share host-to-host service nodes. Asa result, the overlapping micro-models constitute a bipartitegraph with end-to-end services at the tails of the edges andhost-to-host services at the heads of the edges.

In Figure 6, we present the bipartite dependency graph cre-ated by micro-models of end-to-end services provided in thedata-link layer of a simple network composed of four learn-ing bridges [61]. The current spanning tree is marked in boldlines. In the dependency graph, we distinguish betweenlinks,which provide bridge-to-bridge delivery service, andpaths,which provide packet delivery service from the first to the lastbridge on the packet route from the source node to the desti-nation node. The delivery service provided by paths is builtof delivery services provided by links. We find it reasonableto consider unidirectional communication between two hosts aservice, although in many circumstances this would not be nec-essary. However, it is sometimes possible for a communicationbetween two hosts to fail only in one direction, while in theopposite direction it remains intact. By distinguishing betweenopposite directions, it becomes possible to detect these situa-tions. End-to-end service diagnosis in such a network aims atcorrelating symptoms related to path failures in order to isolateone or more responsible link failures.

7.2 Application of belief networks to end-to-end service fail-ure diagnosis

As a result of end-to-end service dependency model conver-sion into a belief network according to the algorithm presentedin Section 4, a bipartite graph is created, which contains twotypes of nodes: nodes corresponding to path failures (symp-toms) and nodes corresponding to link failures (faults). Theapplication of algorithms introduced in Section 5 to diagnosing

end-to-end services using a bipartite model is straightforward.In this section, we analyze algorithms’ complexity in this faultlocalization problem.

1. Bucket eliminationRecall from Section 5.1 that the computational complex-ity of bucket elimination in arbitrary belief networks isO(|V |exp(w∗(o))), wherew∗(o) is the width of the graph in-duced by orderingo, defined in Section 3. Clearly, we need tofind an ordering that induces the minimum width. To do this,let us moralize the belief network corresponding to the depen-dency graph in Figure 6-(b) (in the belief network, all edges arereversed). Using the definition of moralization from Section 3,the belief network corresponding to the dependency graph inFigure 6-(b) is moralized by introducing undirected edges be-tween link–A→B and link–B→C, link–A→B and link–B→D,link–B→C and link–D→B, link–B→D and link–C→B, link–C→B and link–B→A, as well aslink–B→A and link–D→B.Then, the arrows are removed. The moralization leads to thecreation of cliques, e.g., a subgraph containing nodeslink–C→B, link–B→A, andpath–C→A. Each of the cliques con-tains onepathnode and multiplelink nodes. One can observethat in the process of calculating the width of the graph (i.e.,creating the modified ordered graph), traversing the graph ac-cording to the ordering in which alllink nodes are given pri-ority, i.e., they are processed last, would result in the creationof cliques identical to the cliques in the moralized graph. Thewidth of a network ordered in this manner is therefore equalto the maximum clique size in the moralized graph minus one.One can show that this ordering induces the minimum width.The maximum clique size in the moralized graph is bound bythe maximum path length in the original network graph (i.e.,spanning tree in Figure 6-(a)). Thus, for ann-bridge/routernetwork the maximum clique of the moralized inverted depen-dency graph containsn nodes. This lets us conclude that theinduced width of bipartite belief networks modeling end-to-end service failures has an upper bound ofn−1. Therefore,the complexity of bucket elimination in application to end-to-end service diagnosis isO(n2exp(n)).2. Iterative belief propagationIn Section 5.2, we determined the complexity of fault local-ization iterative belief propagation in arbitrary belief networksto beO(|SO||E|). In a bipartite belief network representingend-to-end service model of ann-node network, there are atmostn2 paths (i.e., possible symptoms) and every path may becomposed of at mostn links. Thus, the graph contains at mostn3 edges leading to the complexity of the entire algorithm ofO(|SO|n3)⊆O(n5).3. Iterative MPEIn Section 5.3, we wrote that local computations in nodes re-quireO(d2) operations resulting in the complexity of the en-tire algorithm ofO(|SO||V |d2). Indeed, the calculation ofλmessages, which is performed only by path nodes, requiresO(d2)⊆O(n2) steps. However, messagesπ, which are cal-culated only by fault nodes, may be obtained inO(d)⊆O(n2)steps. Thus the calculation ofλ or π messages in a single iter-ation requiresO(n4) orO(n3) steps, respectively. As a result,the complexity of the entire algorithm in the application to end-to-end service diagnosis isO(|SO|n4)⊆O(n6).

12

link � A→B link � B→Alink � B→C link � C→Blink � B→D link � D→B

path � A→B path � B→Apath � B→C path � C→Bpath � B→D path � D→B

path � A→C

path � A→D

path � C→A

path � D→A

path � D→C

path � C→D

(a) (b)

Bridge A Bridge B

Bridge CBridge D

H2

H1

Figure 6: (a) Example bridge topology with the current spanning tree marked in bold; (b) Dependency graph built based on thespanning tree in (a).

7.3 Simulation model

The simulation study presented in this paper uses tree-shapednetwork topologies, which result, for example, from the usageof the Spanning Tree Protocol [61]. The usage of tree-shapedtopologies greatly simplifies their random generation, whilenot having any significant impact on the accuracy of the resultspresented in this section. We focus on diagnosing performanceproblems, e.g., excessive delay or high loss rate, which neces-sitates the usage of a non-deterministic fault model and faultlocalization algorithm.

We evaluated the algorithms presented in Section 5 using thefollowing simulation model. LetOR represent the alarm ob-servability ratio, i.e.,OR = |SO|/|S|. Let LR represent alarmloss rate, i.e., the ratio of the number of generated alarms thatwere lost to the number of all generated alarms. LetSSR rep-resent spurious alarm rate, i.e., the probability that an alarm isgenerated spontaneously. The three ratiosOR, LR, andSSRare parameters of the simulation model.

Given the simulation model with parametersOR, LR, andSSR, for a given network topology sizen, wheren representsthe number of intermediate network nodes, such as bridges orswitches, we designKn simulation cases as follows:

• We create a random tree-shapedn-node networkNi

(1≤i≤Kn). We denote the set of faults, the set of unob-servable end-to-end failures, and the set of symptoms fornetworkNi asFi, Ei, andSi, respectively. For every linkin Ni, we create onefj∈Fi. Note that in ann-node tree-shaped network there aren−1 links, i.e., |Fi|=n−1. Forevery end-to-end path inNi, we create oneel∈Ei and onesl∈Si. (Recall that|Ei|=|Si|∈O(n2).)

• We randomly generate prior fault probability distributionpf :Fi→[0.001, 0.01]; pf (fj)s are uniformly distributed overthe range[0.001, 0.01]. We randomly generate conditionalprobability distributionpfe: (Fi×Ei)→[0, 1), defined as theprobability thatel occurs givenfj occurs. For allfj∈Fi andel∈Ei such that the path corresponding toel includes the linkcorresponding tofj , pfe(fj , el)s are uniformly distributedover the range(0, 1); otherwise,pfe(fj , el)=0.

• We randomly generate the set of observable alarmsSiO⊆Si

such that on average|SiO||Si| =OR.

• Given networkNi, we build its belief network modelBN i(SiO, ploss, pspurious). We choose ploss=LR

(pspurious=SSR) to perform fault localization that in-cludes alarm loss (spurious alarms) in the analysis. Weuseploss=0 (pspurious=0) to perform fault localization thatdisregards lost (spurious) symptoms.

For i-th simulation case (1≤i ≤Kn), we createMS simulationscenarios as follows.

1 Using pf we randomly generate the set of faultylinks in network Ni, Fk

iC (1≤k≤MS), and createprobability distribution pk

e : Ei→[0, 1], where pke(ej)=

P{ej occurs|all faults inFkiC occur}.

2 Usingpke we randomly generate the set of eventsEk

iC result-ing from faults inFk

iC , and buildSkiC⊆Si, the set of alarms

corresponding to events inEkiC . We setSk

iG=SkiC∩SiO.

3 Using SSR we randomly generate the set of spurious alarmsSk

iS⊆SiO. We setSkiG+S=Sk

iG∪SiS .4 We createSk

iG−L–the set of generated alarms that are notlost by the communication system–by randomly removing

alarms fromSkiG+S so that on average

|SkiG−L|

|SkiG+S

|= 1−LR.

5 We setSkiN=Sk

iG−L; SkiN constitutes the set of negative

symptoms observed in the simulation scenariok.6 Using one of the fault localization algorithms we computeFk

iD⊆Fi, the most likely explanation of symptoms inSkiN .

We calculatedetection rate(DRki ) and false positive rate

(FPRki ) defined using the following equations.

DRki =

|FkiD ∩ Fk

iC ||Fk

iC |FPRk

i =|Fk

iD \ FkiC |

|FkiD|

For i-th simulation case we calculate the mean detec-tion rate DRi= 1

MS

∑MS

k=1 DRki and mean false positive rate

FPRi= 1MS

∑MS

k=1 FPRki . Then, we calculate the expected val-

ues of detection rate and false positive rate denoted by DR(n),and FPR(n), respectively. In our study, we usedKn=100, andMS=100 or 200 depending on the variability of DRki . We var-iedn from 5 to 50.

In the following sections, we apply this simulation model toevaluate the performance and accuracy of fault localization al-gorithms described in Section 5. We used JavaBayes [1] im-plementation of bucket elimination (Algorithm 1). We imple-mented Algorithms 2 and 3 in Java.

13

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

5 10 15 20 25 30 35 40 45 50

Det

ectio

n ra

te

Graph size

Algorithm 1Algorithm 2Algorithm 3

Figure 7: Comparison of accuracies achievable with Algo-rithms 1-3 for different network sizes.

7.4 Comparison of algorithms

The first simulation study was conducted to compare the per-formance and accuracy of the fault localization algorithms de-scribed in Section 5. We intentionally ignore positive, lost, andspurious symptoms. Formally, we setOR=|SN |/|S|, LR=0,andSSR=0 and use belief networkBN (SN ) as a fault prop-agation model. The link failure probabilities are uniformlydistributed random values of the order of10−6, and the con-ditional probabilities on causal links are uniformly distributedrandom values in the range[0.5, 1). Because of excessive sim-ulation time we had to limit the tested network size range forAlgorithms 1 and 3 to 10 and 25, respectively.

Figure 7 presents the relationship between detection rate andnetwork size. The detection rate is shown within statisticallycomputed confidence intervals. We observe that Algorithms 1and 3 outperform Algorithm 2 by 1-3%. The shape of thegraphs in Figure 7 indicates a strong dependency of the de-tection rate on the network size. This relationship may be ex-plained as follows. For small (5-node) networks, the numberof symptoms observed is typically small (less than 10), whichin some cases is not sufficient to precisely pinpoint the actualfault. Since in small networks the size ofFc is also small, anymistake in fault detection significantly reduces the detectionrate. When the network gets larger, the number of observedsymptoms increases, thereby increasing the ability to preciselydetect the faults. On the other hand, as the network size grows,the multi-fault scenarios become more and more frequent. In amulti-fault experiment, it is rather difficult to detect all actualfaults, which leads to partially correct solutions and, in conse-quence, the lower accuracy of Algorithm 2.

As depicted in Figure 8, the false positive rate of Algorithm 1is 1-2% lower than that of Algorithm 2. For small networks,Algorithm 3 has false positive rate comparable to Algorithm 1.However, as networks get bigger, the false positive rate of Al-gorithm 3 grows sharply suggesting that Algorithm 3 has a ten-dency to propose too big a set of faults as a final hypothesisthan is actually needed to explain all symptoms.

Figures 9 and 10 present the dependency of the correlation timeon the network size in the presence of 1, and 4 network faults,

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

5 10 15 20 25 30 35 40 45 50

Fal

se p

ositi

ve r

atio

Graph size


Figure 8: Comparison of false-positive metric values for Algo-rithms 1-3 using different network sizes.

0

1

2

3

4

5

6

5 10 15 20 25 30 35 40 45 50

Cor

rela

tion

time

[sec

]

Graph size


Figure 9: Comparison of single fault detection time for Algo-rithms 1-3 vs. network size

respectively. Although in the tested network size range, Al-gorithm 1 exhibited the best accuracy, the difference betweenthe accuracy of Algorithm 1 and that of other algorithms is toosmall to justify the substantially worsened performance. Al-gorithm 2 proved to be the most efficient one while preservingvery good accuracy. The correlation time of the order of sev-eral seconds, even for large networks and multi-fault scenarios,is encouraging.

Table 1 summarizes the results of the described simulationstudy. To make the comparison of the algorithms complete,we also evaluate them according to the criteria introduced inSection 1, i.e., (1) ability to diagnose multi-fault scenarios, (2)potential for dealing with lost and spurious symptoms, (3) abil-ity to work in the event-driven fashion, (4) incremental build-ing of the hypothesis, and (5) usability for prediction and testplanning. Clearly, all three algorithms are able to detect multi-ple simultaneous faults and are robust against the informationnoise given they are provided with an appropriate fault prop-agation model. Algorithms 2 and 3 are iterative; Algorithm 1is not, because running the inference after every symptom ob-servation would repeat calculations performed in the previousiterations. In addition, Algorithm 3 is incremental, while Algo-rithm 2 would require a complexity-increasing extension (ex-ecuting fault selection phase after every iteration) to possess

14

Table 1: Comparison of Algorithms 1- 3Algorithm Bucket Elimination Iterative Belief Updating Iterative MPE

(Alg. 1) (Alg. 2) (Alg. 3)Theoretical bound n2exp(n) n5 n6

Detection rate 96-100% 93-98% 95-100%False positive rate 0-4% 2-5% 0-8%

Max. network size withlocalization time<10s

10 50 25

Multi-fault scenarios yes yes yesLost and spurious symptoms yes yes yes

Is algorithm iterative? no yes yesIs algorithm incremental? no no yes

Prediction capabilities yes/no answer confidence assessment yes/no answer

0

10

20

30

40

50

60

5 10 15 20 25 30 35 40 45 50

Cor

rela

tion

time

[sec

]

Graph size


Figure 10: Comparison of correlation time vs. network size inthe presence of four network faults

this feature. In addition to fault localization, Algorithms 1and 3 make a binary prediction of network service failures;Algorithm 2 offers an estimation of confidence that a service isaffected.

7.5 Impact of positive symptoms

In this section, we evaluate the impact of including positivesymptoms into the fault localization process performed withthe most promising of the algorithms described in Section 5,Algorithm 2. We utilize the simulation model introduced inSection 7.3 settingLR=0, andSSR=0. Correspondingly, weuseBN (SO) andBN (SN ) as the fault propagation modelsto perform fault localization which includes and does not in-clude positive symptoms, respectively. The link failure prob-abilities are uniformly distributed over the range[0.001, 0.01],and the conditional probabilities on causal links are uniformlydistributed random values in the range[0, 1).

We compare fault localization with these two models using ob-servability ratiosOR equal to 1.0, 0.5, and 0.2. As shown inFigures 11 and 12, the inclusion of positive symptoms in thefault localization process allows the detection and false posi-tive rates to be substantially improved. The improvement isbigger for lower observability ratios; with high observabilityratios (e.g.,OR=1), the number of negative symptoms is typ-ically large enough to allow quite accurate fault localizationwithout considering positive symptoms. However, such highsymptom observability is unlikely in real-life systems. One

may also observe that the accuracy of both algorithms dependson the observability ratio; higherOR means more informa-tion about faults, and, in consequence, higher accuracy. On theother hand, including positive symptoms increases fault local-ization run-time, while increasingOR requires installing addi-tional probes, which increases fault localization overhead. Thesimulation study described in this section allows us to analyzethe trade-off between fault localization accuracy and perfor-mance.

7.6 Impact of lost and spurious symptoms

To isolate the impact of symptom loss on the accuracy of faultlocalization process, we setSSR=0 in the simulation model.Loss rate is either0.05 or 0.1. We compare the accuracyof Algorithm 2 using belief networksBN (SO, LR, 0) (takingsymptom loss into account) andBN (SO, 0, 0) (disregardingthe possibility of symptom loss), varyingOR between 0.2 and0.5. Figures 13 and 14 show that, by including loss rate in theanalysis, the detection (false positive) rate may be increased(decreased) by up to 10%. Moreover, given constantOR, thisgain is insensitive to the value ofLR.

The impact of including spurious symptoms in the fault local-ization process is evaluated by applying Algorithm 2 to faultpropagation modelsBN (SO, 0, SSR) andBN (SO, 0, 0) us-ing LR=0, andOR=0.5. We varySSR between 0.01 and0.1. As shown in Figure 15, the inclusion of spurious symp-toms in the fault localization process in small networks de-creases detection rate. This is explained by the fact that insmall networks (in particular, with small observability ratios),only a few symptoms are available to the fault localization pro-cess. When the possibility of spurious symptoms is taken intoaccount, and the number of symptoms is very small, the al-gorithm concludes that there is no sufficient evidential supportfor the existence of faults, and considers most of these symp-toms spurious. WhenSSR=0.1, for small networks, the prob-ability that all observed symptoms are spurious is frequentlyhigher than the probability of fault occurrence. Therefore, thealgorithm refuses to identify a fault thereby achieving very lowdetection rate. One can conclude that small networks need tobe better instrumented (i.e., have higherOR) to allow fault lo-calization to benefit from the analysis of spurious symptoms.In larger networks, the inclusion of spurious symptoms doesnot cause a decrease in the detection rate; in fact, as shownin Figure 15, it allows the detection rate to be improved. Fig-ure 16 presents the impact of including spurious symptoms on

15

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

5 10 15 20 25 30 35 40

Det

ectio

n ra

te

Network size

BN(SO)BN(SN)

OR = 1.0, LR = 0, SSR = 0OR = 0.5, LR = 0, SSR = 0OR = 0.2, LR = 0, SSR = 0

Figure 11: Detection rate obtained with Algorithm 2 whiledisregarding positive symptoms and while including positivesymptoms for different observability ratios.

0

0.05

0.1

0.15

0.2

5 10 15 20 25 30 35 40

Fal

se p

ositi

ve r

ate

�

Network size

BN(SO)BN(SN)

OR = 1.0, LR = 0, SSR = 0OR = 0.5, LR = 0, SSR = 0OR = 0.2, LR = 0, SSR = 0

Figure 12: False positive rate obtained with Algorithm 2 whiledisregarding positive symptoms and while including positivesymptoms for different observability ratios.

0.7

0.75

0.8

0.85

0.9

0.95

1

5 10 15 20 25 30 35 40

Det

ectio

n ra

te

Network size

BN(SO, LR, 0)BN(SO, 0 , 0)

OR = 0.5, LR = 0.05, SSR = 0OR = 0.5, LR = 0.1, SSR = 0OR = 0.2, LR = 0.1, SSR = 0

Figure 13: Detection rate obtained with Algorithm 2 usingfault propagation modelsBN (SO, LR, 0) andBN (SO, 0, 0).

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

5 10 15 20 25 30 35 40

Fal

se p

ositi

ve r

ate

�

Network size

BN(SO, LR, 0)BN(SO, 0 , 0)

OR = 0.5, LR = 0.05, SSR = 0OR = 0.5, LR = 0.1, SSR = 0OR = 0.2, LR = 0.1, SSR = 0

Figure 14: False positive rate obtained with Algorithm 2 usingfault propagation modelsBN (SO, LR, 0) andBN (SO, 0, 0).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40

Det

ectio

n ra

te

Network size

BN(SO, 0, SSR)BN(SO, 0, 0)

OR = 0.5, LR = 0, SSR = 0.01OR = 0.5, LR = 0, SSR = 0.1

Figure 15: Detection rate obtained with Algorithm 2using fault propagation modelsBN (SO, 0, SSR) andBN (SO, 0, 0).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40

Fal

se p

ositi

ve r

ate

�

Network size

BN(SO, 0, SSR)BN(SO, 0, 0)

OR = 0.5, LR = 0, SSR = 0.01OR = 0.5, LR = 0, SSR = 0.1

Figure 16: False positive rate obtained with Algo-rithm 2 using fault propagation modelsBN (SO, 0, SSR) andBN (SO, 0, 0).

16

the false positive rate. It shows that regardless of the networksize, by taking spurious symptoms into account, the false pos-itive rate of the fault localization process may be significantlydecreased.

7.7 Impact of probability estimation errors

One of the drawbacks of applying non-deterministic reasoningto the fault localization problem is the necessity to determinethe probabilities assigned to nodes and edges in the fault prop-agation model. Researchers frequently state that these proba-bilities may be assigned by a human expert [41]. This processbeing error prone, it is likely that the probabilities assigned bythe expert will differ from those describing the real system. Inactuality, the expert assigns discrete confidence levels ratherthan the exact probabilities.

In this section, we analyze the impact of such probability es-timation on the accuracy of Algorithm 2. For this purpose,we designed simulation experiments as described in Section 7.However, in the belief network, exact probabilities are replacedby c confidence levels;i-th confidence level (i=1 . . . c) is rep-resented by probabilityi−1

c + 12c . Thus, probabilityp in the real

system is estimated as probabilitybpccc + 1

2c .

Figures 17 and 18 compare the detection rate of Algorithm 2having exact knowledge of the probability distribution with thedetection rate achieved using one, two, and three confidencelevels for various observability ratios. Similar comparison offalse positive rates is shown in Figures 19 and 20. It can beobserved that probability estimation with three confidence lev-els allows fault localization to be almost as accurate as withthe knowledge of exact probabilities. This observation has animportant implication: it allows the expert to use a small setof meaningful qualitative probability assignments such asun-likely, possible, andlikely, which correspond to confidence lev-els 1, 2, and 3, respectively.

8 OTHER CANONICAL MODELS

So far in this paper, we have focused on fault propagation mod-els represented by noisy-OR-gate belief networks, which allowfault localization to be performed in polynomial time using Al-gorithm 2. In noisy-OR-gate networks, a variable value is ob-tained by combining its predecessors’ values using logicalor.This is, clearly, the most useful and wide-spread representationin the area of fault management [41, 42].

Unfortunately, for some fault management problems, thenoisy-OR-gate model is inadequate. In this section, we ex-plore the application of other canonical belief network modelsto solving these problems.

8.1 AND model

Let us consider a popular high-availability scenario in whichtwo alternative physical network connections are provided be-tween two neighboring hosts. To model this situation usingthe graph proposed in Section 2, we create nodeX to rep-resent connectivity between the two hosts, and nodesY1 andY2 to represent physical connections, whereX depends onY1

andY2. When one of the physical connections fails, i.e.,Y1

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

5 10 15 20 25 30 35 40 45

Det

ectio

n ra

te

Network size

exactthree confidence levels

two confidence levelsone confidence level

Figure 17: Detection rate of Algorithm 2 with exact andapproximate knowledge of the probability distribution forOR=0.5.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

5 10 15 20 25 30 35 40 45

Det

ectio

n ra

te

Network size



Figure 18: Detection rate of Algorithm 2 with exact andapproximate knowledge of the probability distribution forOR=0.2.

or Y2 is in failure modeF1, the entire traffic between the twohosts is transferred to the second, still operating connection.Thus, only if both physical connections fail, may the connec-tivity failure between the hosts be observed. Clearly,Y1 andY2 do not independently contribute to the failureF1 of X, andtherefore this high-availability scenario may not be representedusing the noisy-OR model.

A possible solution to the problem is to detect hard failuresof Y1 and Y2 using configuration-change monitoring toolsavailable in many management systems (see examples in Sec-tion 2.3), and accordingly modify the model so that noisy-OR property is preserved. However, in many situations, apreferable approach would rather embed the information aboutthe high-availability configuration into the fault propagationmodel, so that the monitoring of the configuration changes maybe avoided. The relationship between conditionF1 in X, andY1 andY2 should be modeled by combiningX ’s predecessors’values using logicaland. To perform belief updating using Al-gorithm 2, the Equations (2) and (4) in Section 5.2 used tocalculateλ andπ in an and-nodeX, should be modified as

17

0

0.05

0.1

0.15

0.2

0.25

0.3

5 10 15 20 25 30 35 40 45

Fal

se p

ositi

ve r

ate

�

Network size



Figure 19: False positive rate of Algorithm 2 with exactand approximate knowledge of the probability distribution forOR=0.5.

0

0.05

0.1

0.15

0.2

0.25

0.3

5 10 15 20 25 30 35 40 45

Fal

se p

ositi

ve r

ate

�

Network size



Figure 20: False positive rate of Algorithm 2 with exactand approximate knowledge of the probability distribution forOR=0.2.

follows.

π(x) ={

α∏m

j=1 cVjXπjX if x = 1α(1−

∏mj=1 cVjXπjX) if x = 0 (10)

λX(vj)= β(λ(0) + vjc

vj

VjX(λ(1)− λ(0))∏k 6=j

cVkXπkX

)(11)

8.2 NOT model

In the NOT model, a variable value is calculated as a logicalnegation of its single predecessor’s value. Intuitively, for avariable to be true, its predecessor must not be activated, orif its predecessor is activated, the inhibitor on the correspond-ing causal link is activated as well. The replacement for Equa-tions (10) and (11) in the NOT model are obtained in a straight-forward manner by settingm=1 and exchangingπ(x=1) withπ(x=0) andλX(vj=1) with λX(vj=0), respectively.

8.3 A hybrid model

In real-life scenarios, a hybrid model is useful, in which a nodemay apply different logical operators to different subsets of its

predecessors. Let us consider two cases when such a modelwould be used.

Recall from Section 2 that conditionF1 (hard failure) associ-ated with a service or functionX excludes the possibility ofthe occurrence of performance-related failure conditions, e.g.,F2. In the belief network, we would model this situation bycreating nodeVi, which represents the cause of conditionF1

(e.g., a broken cable on link a–b). Then, we would connectVi

to the node representing conditionF2 in X, Vj . The nodeVj

combinesVi with all its other predecessors,U1, . . . , Uk, usingformulanotVi and(U1 or . . . or Uk).

A similar problem arises when we attempt to model symp-tom loss in a management system that transports symptoms in-band. Then, the transmission of symptomSi from nodea to themanagement nodem relies on the quality of patha→m. Highprobability of packet loss on patha→m (conditionF2 associ-ated with patha→m) lessens the probability of the symptomSi

reception by the management station. To represent this fact, wewould add an edge from the node corresponding to conditionF2 associated with patha→m, Wj , to the node representingsymptomSi, VSi , in the extended belief network (Section 6).NodeVSi should combineWj with its other predecessor,V ′

Si,

using formulanotWj andV ′Si

.

In the extreme case, a hybrid model associated with variableVi could involve an arbitrary logical expression overVi’s pre-decessors. In this case, calculatingπ andλ would correspondto performing inference in a belief network identical with thesyntax-decomposition tree of the logical expression, which isrooted atVi. In practice, much simpler hybrid models areneeded, in whichπs andλs may be calculated using compactand easy to obtain formulas.

9 RELATED WORK

In the past, various event correlation techniques were proposedincluding rule-based systems [47, 79], model-based reasoningsystems [34, 56], model traversing techniques [36, 38], case-based systems [46], fault propagation models [29, 41], and thecode-book approach [80]. Most of these approaches are de-terministic, while this paper focuses on non-deterministic faultdiagnosis techniques. In this section, we compare the solutionsproposed in this paper against other approaches to modelingsystems for the purpose of fault diagnosis and performing faultlocalization in communication systems.

9.1 Fault propagation models

The design of the model capable of representing fault propa-gation patterns in complex communication systems has been asubject of several publications in the past [27, 33, 39, 41, 57].Researchers usually propose that communication systems bemodeled using a dependency graph. Some approaches dividephysical and abstract system components into a small set ofclasses, e.g.,node, path, andshared resource[33], or func-tion, service, and protocol [27]. SMARTS [80] allows theset of classes to be extended using the specialized definitionlanguage called MODEL [57]. In [39], relationships betweencomponents are obtained through two basic methods:getNex-tHop(), andgetLowerLayer()indicating horizontal and vertical

18

relationships between system components, respectively. Theset of relationships may be extendible as it is the case in [57].Clearly, the model proposed in this paper builds upon theseideas, and is particularly close to the layered system modelproposed in [27]. However, our model incorporates end-to-end service micro-models (not available in [27]), which makethe modeling technique suitable for communication systemswith complex topologies. The most significant difference be-tween our model and the models proposed in [27, 33, 39] isthat our model is non-deterministic. The approach proposedin [57] models non-determinism by labeling relationships be-tween components with probabilities; OR-model of condi-tional probability distribution is assumed. The model proposedin this paper allows an arbitrary conditional probability distri-bution.

A different form of the dependency graph has been proposedin [41]. The dependency graph consists of nodes that representindivisible objects within the communication system. With ev-ery node a probability is associated that represents the likeli-hood of the corresponding object’s independent failure. Theedges indicate dependencies between system components, andtheir associated weights indicate the probability that the nodeat the head of an edge fails if the node at the tail of the edgefails. The model presented in [41] is flat; It does not distinguishbetween protocol layers. We believe that the model proposedin this paper is more suitable for multi-layer fault diagnosisby exploring the layered architecture of typical communicationsystems. It is also easier to build, since it provides a hierarchi-cal template whose sub-components may be created separately,each with a desired degree of detail.

None of the models proposed in [27, 33, 39, 41] allows theperformance problems to be modeled. This paper addressesthis issue by introducing multiple failure modes.

The hierarchical modeling of uncertainty with belief networkspresented in this paper is consistent with other work on con-structing diagnostic models [71]. Contrary to the approachin [71], in our model, each component may have more thanone output, which allows representation of different types ofinfluences caused by one component on its dependent compo-nents.

9.2 Non-deterministic fault localization techniques

When the fault propagation model is provided, the fault isola-tion process explores the model to verify correlation betweenevents. Model-based reasoning systems [34, 56] utilize infer-ence engines controlled by a set of correlation rules, whichcontain model exploration predicates. Model-traversal tech-niques recursively search the dependency graph toward thefailing object in an event-driven fashion starting from the com-ponent that symptoms refer to [33, 36, 38]. To the best ofour knowledge, all model-based reasoning systems and model-traversal techniques reported in the literature are deterministic.

So far, the most comprehensive approach to non-deterministicfault localization has been proposed by Katzela et al. [41]. Thefault propagation algorithm in [41] uses a dependency graph asa fault model and exploits the maximum mutual dependencyheuristics to isolate multiple simultaneous faults. This heuris-

tics relies on the assumption that of two sets of faults which ex-plain all the observed symptoms, the better solution is the onethat exhibits the higher mutual dependency among its mem-bers. We believe that in communication systems in which mostfaults are independent of one another, this assumption doesnot hold. In addition to this problem, the approach presentedin [41] does not allow lost or spurious symptoms. The correla-tion is window-based and only one failure mode per object isallowed.

Kliger et al. [42] propose a probabilistic model to be used withthe codebook approach [80], which creates a symptom-faultdictionary (called a codebook) by reducing the fault propaga-tion model to a bipartite graph using serial and parallel reduc-tion operators. In a non-deterministic case, this reduction isan NP-hard problem on its own. In addition, the reduction toa bipartite graph loses some dependency information that maybe useful in fault localization. The deterministic codebook ap-proach is able to isolate only one network fault in a window-based manner. More importantly, no non-deterministic decod-ing schema is proposed in [42]. The algorithms proposed inthis paper (Algorithms 2 and 3) can be used for this purpose.These algorithms do not require graph reduction, and thereforetake advantage of all the available dependency information. Inaddition, multiple simultaneous faults may be detected in aniterative and, in the case of Algorithm 3, incremental manner.

The literature on event correlation contains reports of applyingbelief networks to fault diagnosis. However, the approachesare limited to rather specific fault diagnosis problems, whichuse simplified belief network models. In fault diagnosis in lin-ear light-wave networks [20] and in diagnosing connectivityproblems in communication systems [78] conditional probabil-ities are 0,1-valued. To trouble-shoot printing services a tree-shaped belief network is used [31]. A bipartite belief networkis used in [11] to pinpoint LAN segments suspected of havinga particular fault. The reasoning mechanism used in [11] isnot able to precisely identify the fault (e.g., the malfunctioningnode). Moreover, it is based on observations that are particularto the LAN environment and are not applicable in general. Thetechniques proposed in [11, 20, 31, 78] derive heuristics thatapproximate Bayesian reasoning in the particular fault local-ization problems they attempt to solve. In contrast, this paperadapts techniques of Bayesian reasoning in belief networks ofarbitrary shape to provide a single solution to a wide-range offault localization problems. In addition, the solutions proposedin [11, 20, 31, 78] do not allow lost or spurious symptoms, andsome of them are window-based [11, 20, 78].

Other approaches to dealing with uncertainty in network faultdiagnosis include an application of Dempster-Shafer theory todetect break faults in communications networks [16]. Simi-larly to [78], this technique is tailored specifically to diagnos-ing connectivity problems in networks with known and statictopologies. This solution would not be sufficient for the pur-pose of diagnosing performance problems, or when the knowl-edge of the network topology is uncertain or incomplete.

Statistical data analysis methods were used for non-deterministic fault diagnosis in bipartite-graphs in [25]. Thesolution is proposed to detect link failures in wireless and/or

19

battlefield networks based on the observed set of brokenend-to-end connections focusing on unknown and constantlychanging network topologies. This technique [25] is not ableto deal with lost or spurious symptoms and it is window-based.

10 CONCLUSIONS ANDFUTURE WORK

This paper investigates an application of Bayesian reason-ing using belief networks [60] to non-deterministic fault di-agnosis in complex communication systems. We introducea non-deterministic model template, which incorporates bothavailability and performance problems and reveals the multi-layered structure of communication systems. This templatemay be extended with micro-models of particular networkfunctions and services, to create a comprehensive systemmodel with a desired level of detail. As an example of a micro-model, we introduce a model of end-to-end services providedin a given protocol layer. We propose a mapping of the lay-ered dependency graph into a belief network and a mappingof the fault localization problem into the problem of comput-ing queries in belief networks. Then, we investigate an appli-cation of known algorithms for performing Bayesian reason-ing in belief networks to fault localization. We directly ap-ply the bucket elimination framework [17] and adapt two algo-rithms for calculating queries in singly-connected belief net-works (polytrees): (1) belief updating and (2) most-probableexplanation [60].

We conclude that, due to its exponential computational com-plexity, the exact Bayesian inference is not feasible in the ap-plication to fault localization. However, as revealed by thesimulation study on end-to-end service failure diagnosis, theapproximate techniques offer much better (polynomial) perfor-mance while retaining almost optimal accuracy. The approx-imate techniques proposed in this paper meet all or most ofthe objectives of our research: (1) They are iterative; (2) Theyare able to detect multiple simultaneous faults; (3) They aresuitable for a variety of fault localization problems; and (4)They facilitate other fault management tasks, e.g., determin-ing the system components affected by faults. The techniquesare able to perform reasoning with both positive and negativeinformation and, as shown through the case study involving Al-gorithm 2, they seamlessly deal with observation noise allow-ing the accuracy of fault localization in the presence of lost andspurious symptoms to be improved. In addition, Algorithm 2 isresilient to the inaccuracy of probability estimation in the faultpropagation model, which proves its feasibility in the real-lifeapplications. We therefore conclude that belief networks are apromising model for non-deterministic fault localization.

While Bayesian reasoning proves to be powerful in the appli-cation to fault localization, this paper does not suggest that itis the only tool to be used for this purpose. In fact, we be-lieve that fault localization would benefit from combining var-ious paradigms, including Bayesian reasoning, into a commonsolution. The design of such hybrid tools is an interesting fu-ture research problem. As an example of a fault localizationparadigm that could be combined with Bayesian reasoning, weconsider a model-based layered approach introduced in [2].

The performance of fault localization using belief networks

could be improved by applying hierarchical reasoning, inwhich the fault localization is performed first using a high-level model to pinpoint the location of the problem, and thena more detailed model would be used to identify the preciselocation of the fault within the previously pinpointed location.The theoretical foundation for such an approach has been laidin [71]. Since the high-level model of the entire system anda detailed model of the pinpointed location are both smallerin size than a single detailed system model, the hierarchicalreasoning reduces the complexity of the fault localization task.This paper implicitly advocates the usage of hierarchical rea-soning by presenting a case study on the analysis of end-to-endservice failures, which is an example of a high-level fault lo-calization problem. The services identified as faulty duringend-to-end service diagnosis need to be later analyzed on thedetailed level to precisely determine the root causes. The iden-tification of high- and low-level problems, the determinationof the optimal number of hierarchical reasoning levels, and thedesign of comprehensive hierarchical fault localization algo-rithms remain the subjects of future research.

In the area of end-to-end fault diagnosis, the future researchwill investigate distributed fault localization techniques inwhich multiple fault localization applications cooperate to iso-late the causes of end-to-end service failures. Such a dis-tributed solution should explore domain semantics of typicalcommunication networks. We will also address end-to-end ser-vice failure diagnosis in situations where the diagnostic modelis difficult to build. We will investigate performing reasoningwith incomplete and ambiguous dependency information.

REFERENCES

[1] http://www.cs.cmu.edu/ javabayes/Home/.[2] K. Appleby, G. Goldszmidt, and M. Steinder. Yemanja–a layered event

correlation system for multi-domain computing utilities.Journal of Net-work and Systems Management, 10(2):171–194, 2002.

[3] S. Bagchi, G.Kar, and J. Hellerstein. Dependency analysis in distributedsystems using fault injection: application to problem determination inan e-commerce environment. In Festor and Pras [26].

[4] C. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannon limiterror-correcting coding and decoding: Turbo Codes. InProc. Int’l Com-munications Conference, pp. 1064–1070, 1993.

[5] A. Bierman and K. Jones.Physical topology MIB. IETF Network Work-ing Group, 2000. RFC 2922.

[6] A. T. Bouloutas, S. Calo, and A. Finkel. Alarm correlation and faultidentification in communication networks.IEEE Transactions on Com-munications, 42(2/3/4):523–533, 1994.

[7] Y. Breitbart, M. Garofalakis, C. Martin, R. Rastogi, S. Seshadri, andA. Silberschatz. Topology discovery in heterogeneous IP networks. InProc. of IEEE INFOCOM, pp. 265–274, 2000.

[8] A. Brown, G. Kar, and A. Keller. An active approach to characteriz-ing dynamic dependencies for problem determination in a distributedapplication environment. In Pavlou et al. [59].

[9] J. Case, M. Fedor, M. Schoffstall, and J. Davin.A Simple NetworkManagement Protovol (SNMP). IETF Network Working Group, 1990.RFC 1157.

[10] J. D. Case, K. McCloghrie, M. T. Rose, and S. Waldbusser.ProtocolOperations for Version 2 of the Simple Network Management Protocol(SNMPv2). IETF Network Working Group, 1996. RFC 1905.

[11] C. S. Chao, D. L. Yang, and A. C. Liu. An automated fault diagnosissystem using hierarchical reasoning and alarm correlation.Journal ofNetwork and Systems Management, 9(2):183–202, 2001.

[12] R. Comerford. The new software paladins.IEEE Spectrum, 37(6), Jun.2000.

[13] G. F. Cooper. Probabilistic inference using belief networks is NP-Hard.Technical Report KSL-87-27, Stanford University, 1988.

[14] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter.Probabilistic Networks and Expert Systems. Statistics for Engineeringand Information Science. Springer-Verlag, 1999.

[15] P. Dagum and M. Luby. Approximately probabilistic reasoning inBayesian belief networks is NP-hard.Artificial Intelligence, pp. 141–153, 1993.

[16] N. Dawes, J. Aloft, and B. Pagurek. Network diagnosis by reasoning in

20

uncertain nested evidence spaces.IEEE Transactions on Communica-tions, 43(2/3/4):466–476, 1995.

[17] R. Dechter. Bucket Elimination: A unifying framework for probabilisticinference. In E. Horvitz and F. V. Jensen, eds,Uncertainty in ArtificialIntelligence, pp. 211–219. Morgan Kaufmann Publishers, 1996.

[18] R. Dechter and I. Rish. A scheme for approximating probabilistic infer-ence. In D. Geiger and P. Shenoy, eds,Uncertainty in Artificial Intelli-gence, pp. 132–141. Morgan Kaufmann Publishers, 1997.

[19] E. Decker, P. Langille, A. Rijsinghani, and K. McCloghrie.Definitionof Managed Objects for Bridges. IETF Network Working Group, 1993.RFC 1493.

[20] R. H. Deng, A. A. Lazar, and W. Wang. A probabilistic approach tofault diagnosis in linear lightwave networks. In Hegering and Yemini[32], pp. 697–708.

[21] DoD. Military Standard—Interoperability Standard for Digital Mes-sage Device Subsystems (MIL-STD 188-220B), Jan. 1998.

[22] R. Dssouli, K. Saleh, E. Aboulhamid, A. En-Nouaary, and C. Bourhfir.Test development for communication protocols: Towards automation.Comput. Networks, 31:1835–1872, 1999.

[23] C. Ensel. Automated generation of dependency models for service man-agement. InWorkshop of the OpenView University Association (OVUA),Bologna, Italy, Jun. 1999.

[24] S. Fakhouri, G. Goldszmidt, I. Gupta, M. Kalantar, and J. Pershing.GulfStream – a system for dynamic topology management in multi-domain server farms. InIEEE International Conference on ClusterComputing, 2001.

[25] M. Fecko and M. Steinder. Combinatorial designs in multiple faultslocalization for battlefield networks. InIEEE Military Commun. Conf.(MILCOM), McLean, VA, 2001.

[26] O. Festor and A. Pras, eds.Twelth Int’l Workshop on Distributed Sys-tems: Operations and Management, Nancy, France, Oct. 2001.

[27] R. Gopal. Layered model for supporting fault isolation and recovery. InNOMS’00 [53], pp. 729–742.

[28] R. Govindan and H. Tangmunarunkit. Heuristics for Internet map dis-covery. InProc. of IEEE INFOCOM, pp. 1371–1380, 2000.

[29] M. Hasan, B. Sugla, and R. Viswanathan. A conceptual framework fornetwork management event correlation and filtering systems. In Slomanet al. [70], pp. 233–246.

[30] P. Hasselmeyer. Managing dynamic service dependencies. In Festorand Pras [26].

[31] D. Heckerman and M. P. Wellman. Bayesian networks.Communica-tions of the ACM, 38(3):27–30, Mar. 1995.

[32] H. G. Hegering and Y. Yemini, eds.Integrated Network ManagementIII . North-Holland, Apr. 1993.

[33] K. Houck, S. Calo, and A. Finkel. Towards a practical alarm correlationsystem. In Sethi et al. [68], pp. 226–237.

[34] G. Jakobson and M. D. Weissman. Alarm correlation.IEEE Network,7(6):52–59, Nov. 1993.

[35] D. Johnson and D. Maltz.Dynamic Source Routing in Ad Hoc WirelessNetworks, chapter 5, pp. 153–181. Kluwer Academic Publishers, 1996.

[36] J. F. Jordaan and M. E. Paterok. Event correlation in heterogeneous net-works using the OSI management framework. In Hegering and Yemini[32], pp. 683–695.

[37] G. Kar, A. Keller, and S. Calo. Managing application services overservice provider networks. In NOMS’00 [53].

[38] S. Katker. A modeling framework for integrated distributed systemsfault management. In C. Popien, ed,Proc. IFIP/IEEE Int’l Conferenceon Distributed Platforms, pp. 187–198, Dresden, Germany, 1996.

[39] S. Katker and M. Paterok. Fault isolation and event correlation for inte-grated fault management. In Lazar et al. [44], pp. 583–596.

[40] I. Katzela. Fault Diagnosis in Telecommunications Networks. PhDthesis, School of Arts and Sciences, Columbia University, New York,http://www.comm.utoronto.ca/∼irene/thesis.html, 1996.

[41] I. Katzela and M. Schwartz. Schemes for fault identification in commu-nication networks.IEEE Transactions on Networking, 3(6):733–764,1995.

[42] S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo. A codingapproach to event correlation. In Sethi et al. [68], pp. 266–277.

[43] F. R. Kschischang and B. J. Frey. Iterative decoding of compound codesby probability propagation in graphical models.Journal on SelectedAreas in Communications, 16(2):219–230, Feb. 1998.

[44] A. Lazar, R. Saracco, and R. Stadler, eds.Integrated Network Manage-ment V. Chapman and Hall, May 1997.

[45] D. Lee and M. Yannakakis. Principles and methods of testing finite statemachines—a survey.Proc. IEEE, 84(8):1090–1123, Aug. 1996.

[46] L. Lewis. A case-based reasoning approach to the resolution of faults incommunications networks. In Hegering and Yemini [32], pp. 671–681.

[47] G. Liu, A. K. Mok, and E. J. Yang. Composite events for network eventcorrelation. In Sloman et al. [70], pp. 247–260.

[48] B. Lowecamp, D. R. O’Hallaron, and T. R. Gross. Topology discoveryfor large Ethernet networks. InProc. of ACM SIGCOMM, pp. 239–248,2001.

[49] K. McCloghrie and M. Rose.Management Information Base for Net-work Mangement of TCP/IP-based internets: MIB-II. IETF NetworkWorking Group, 1991. RFC 1213.

[50] R. J. McEliece, D. J. C. MacKay, and J.-F. Cheng. Turbo decoding as aninstance of Pearl’s ”belief propagation” algorithm.Journal on SelectedAreas in Communications, 16(2):140–151, Feb. 1998.

[51] D. M. Meira and J. M. S.Nogueira. Modelling a telecommunicationnetwork for fault management applications. In NOMS’98 [54], pp. 723–732.

[52] F. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagationfor approximate inference: An empirical study. In K. B. Laskey andH. Prade, eds,Uncertainty in Artificial Intelligence. Morgan KaufmannPublishers, 1999.

[53] Proc. Network Operation and Management Symposium, Honolulu,Hawaii, 2000.

[54] Proc. Network Operation and Management Symposium, New Orleans,LA, 1998.

[55] M. Novaes. Beacon: A hierarchical network topology monitoring sys-tem based in IP multicast. In A. Ambler, S. B. Calo, and G. Kar,eds,Services Management in Intelligent Networks, no. 1960 in LectureNotes in Computer Science, pp. 169–180. Springer-Verlag, 2000.

[56] Y. A. Nygate. Event correlation using rule and object based techniques.In Sethi et al. [68], pp. 278–289.

[57] D. Ohsie, A. Mayer, S. Kliger, and S. Yemini. Event modeling with theMODEL language. In Lazar et al. [44], pp. 625–637.

[58] K. Ohta, T. Mori, N. Kato, H. Sone, G. MAnsfield, and Y. Nemoto.Divide and conquer technique for network fault management. In Lazaret al. [44], pp. 675–687.

[59] G. Pavlou, N. Anerousis, and A. Liotta, eds.Integrated Network Man-agement VII. IEEE, May 2001.

[60] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann Publishers, 1988.

[61] R. Perlman. Interconnections, Second Edition: Bridges, Routers,Switches, and Internetworking Protocols. Addison Wesley, 1999.

[62] S. Ramanathan, D. Caswell, and S. Neal. Auto-discovery capabilitiesfor service management: An ISP case study.Journal of Network andSystems Management, 8(4):457–482, 2000.

[63] A. Reddy, D. Estrin, and R. Govindan. Large-scale fault isolation.jsac,18(5):723–732, 2000.

[64] I. Rish and R. Dechter. On the impact of causal independence. Tech-nical report, Dept. of Information and Computer Science, University ofCalifornia, Irvine, 1998.

[65] I. Rish and M. Singh. A tutorial on infrence and learning in Bayesiannetworks. IBM Watson Research Center, August 2000.

[66] S. H. Schwartz and D. Zager. Value-oriented network management. InNOMS’00 [53].

[67] C. Scott, P. Wolfe, and M. Erwin.Virtual private networks. O’Reilly,second edition, 1999.

[68] A. S. Sethi, F. Faure-Vincent, and Y. Raynaud, eds.Integrated NetworkManagement IV. Chapman and Hall, May 1995.

[69] R. Siamwalla, R. Sharma, and S. Keshav. Discovering Internet topol-ogy. Technical report, Cornell University, 1998.

[70] M. Sloman, S. Mazumdar, and E. Lupu, eds.Integrated Network Man-agement VI. IEEE Publishing, May 1999.

[71] S. Srinivas. Building diagnostic models from functional schematics.Technical Report KSL 95-15, Knowledge Systems Laboratory, Dept. ofComputer Science, Stanford University, Feb. 1994.

[72] M. Steinder and A. S. Sethi. Non-deterministic diagnosis of end-to-end service failures in a multi-layer communication system. InProc. ofICCCN, pp. 374–379, Scottsdale, AZ, 2001.

[73] M. Steinder and A. S. Sethi. Increasing robustness of fault localizationthrough analysis of lost, spurious, and positive symptoms. InProc. ofIEEE INFOCOM, New York, NY, 2002.

[74] M. Steinder and A.S. Sethi. End-to-end service failure diagnosis usingbelief networks. InProc. Network Operation and Management Sympo-sium, Florence, Italy, 2002.

[75] W. R. Stevens.TCP/IP Illustrated, vol. I. Addison Wesley, first edition,1995.

[76] Tivoli. Netview for Unix: Administrator’s Guide, Version 6.0, Jan. 2000.[77] C. Wang and M. Schwartz. Fault detection with multiple observers.

IEEE Transactions on Networking, 1(1):48–55, Feb. 1993.[78] C. Wang and M. Schwartz. Identification of faulty links in dynamic-

routed networks. Journal on Selected Areas in Communications,11(3):1449–1460, Dec. 1993.

[79] P. Wu, R. Bhatnagar, L. Epshtein, M. Bhandaru, and Z. Shi. Alarmcorrelation engine (ACE). In NOMS’98 [54], pp. 733–742.

[80] S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. Highspeed and robust event correlation.IEEE Communications Magazine,34(5):82–90, 1996.

A BUCKET ELIMINATION ALGORITHM

Bucket elimination MPE [17]

initialize bucketsB1, . . ., B|V | for variablesV1, . . ., V|V |Backward phase:

for p = |V | downto1 doif (Vp = vp) ∈ e then

for eachhj(Vj1 , . . . , Vjm) ∈ Bp dolet jk be an index ofVp in the parameter list ofhj

let h′j(Vj1 , . . . , Vjk−1 , Vjk+1 , . . . , Vjm) =hj(Vj1 , . . . , Vjk−1 , vp, Vjk+1 , . . . , Vjm)

21

puth′j in the bucket of variableVjl ∈ var(hj)that has the highest number in orderingo

else letUp =⋃|Bp|

i=1var(hi), hi ∈ Bp

let ik be an index ofVp in the parameter list ofhi

for all Ap computehp(Ap) = maxvp

∏|Bp|i=1

hi(vAp

i1, . . . , v

Ap

ik−1, vp,

vAp

ik+1, . . . , v

Ap

im)

voptp (Ap) = argmaxVp

hp(Ap)Forward phase:

for p = 1 upto|V | doAoptp = Aopt

p−1 ∪ {voptp (Aopt

p−1)}

Bucket elimination works by creating a set of buckets, onefor every random variable in the belief network. Given or-dering o, the random variables are numbered consecutively.Initially, the bucket for variableVi, Bi, contains all func-tions hj(Vj1 , . . . , Vjm)=Pj1 , such that none of the variablesVj1 , . . . , Vjm is higher in orderingo thanVi andVi appears inthe parameter list ofhj ; Pj1 is anm-dimensional conditionalprobability matrix associated with nodeVj1 and random vari-ablesVj2 , . . . , Vjm

are all parents of random variableVj1 . Thebuckets are then eliminated starting from the last according toorderingo. Eliminating a bucket removes its correspondingvariable from all functions in this bucket. If the bucket’s cor-responding variable has been observed, then the eliminationof the bucket is performed by assigning the observed value ineachm-parameter function in the bucket and placing thus cre-ated(m−1)-parameter function in the bucket correspondingto its highest numbered variable. A bucket corresponding toan unobserved variable is eliminated by converting all its func-tions into one function using maximization as the eliminationoperator. The elimination performed in bucketBi computes,for all possible value assignments to variables mentioned inBi

excludingVi, the value of variableVi that maximizes the prod-uct of all functions inBi.

In the second phase, the algorithm proceeds from the lowestnumbered variable to the highest numbered one and collectsthe values of their most probable assignments. Initially, the as-signment contains only the optimal value forV1. In the i-thstep, the partial assignmentAopt

i−1=(v1, . . . , vi−1) is extendedwith the optimal value for variableVi computed in the back-ward phase for partial assignmentAopt

i−1.

The following notation is used in the formal presentation ofthe algorithm. (1) Ifhi belongs to bucketBp, var(hi) de-notes the set of all variables mentioned inhi excludingVp.(2) Ap is an assignment of values to variables inUp, where

Up={V1, . . . , Vp}. (3) ForVi∈Up, vAp

i is a value of variableVi in assignmentAp.

B APPROXIMATION FORALGORITHM 3

Let E(v1, . . . , vm)=(1−∏

Vj=1 qVjX)∏

k π∗X(vk). A combi-nation that maximizes the value of expressionE(v1, . . . , vm)must contain at least one assignmentVj=1. Otherwise,E(v1, . . . , vm) would be equal to zero. The approximationpresented in this paper is based on the fact that the obser-vation X=1 is more likely to have been caused by an ac-tivation of a single parent ofX, than by simultaneous acti-vations of two or more parents ofX. The calculation pre-sented below aims at finding the set of all parents ofX,

π1, that should be assigned to one in the combination thatbest explains the observationX=1. The best choices forVjs ∈ π1 are those parents ofX for which π∗X(Vj=0)=0,because all combinations in which suchVj=0 result inE(v1, . . . , vm)=0. If no Vjs exist such thatπ∗X(Vj=0)=0,then we pick oneVj for which cVjXπ∗X(1)/π∗X(0) is max-imum. In this expression,cVjXπ∗X(1) and π∗X(0) representan estimate ofVj ’s contribution toE(v1, . . . , vm) with vj=1and vj=0, respectively. ExpressioncVjXπ∗X(1)/π∗X(0) ap-

proximatesE(v1,...,vj−1,1,vj+1,...,vm)E(v1,...,vj−1,0,vj+1,...,vm) , i.e., the benefit resulting

from changingVj ’s value from 0 to 1 in the parameter listof E(v1, . . . , vm). Vj ’s for which neitherπ∗X(Vj=0)=0 norcVjXπ∗X(1)/π∗X(0) is maximum are assigned to one only iftheir π∗X(vj=0)≤π∗X(vj=1). This condition ensures thatVj

has bigger contribution to the value ofE(v1, . . . , vm) when itis assigned to 1 rather than 0, which is proven by the followinginequality:

E(v1, . . . , vj−1, 1, vj+1, . . . , vm)E(v1, . . . , vj−1, 0, vj+1, . . . , vm)

=

1− qVjX

∏Vk 6=Vj |vk=1 qVkX

1−∏

Vk 6=Vj |vk=1 qVkX· π∗X(vj = 1)π∗X(vj = 0)

≥ 1

Vjs that do not meet any of the conditions described above areassigned value 0.

Let π1={Vj |π∗X(vj=0)=0} if at least one suchVj exists. Oth-erwise,π1={VB}whereVB=argmaxVj

(cVjXπ∗X(1) /π∗X(0)).Let π0={Vj |π∗X(vj=1)=0}. The following equations sum-marize our approximation technique of the maximization forx = 1.

max{vk}

(P (x|v1, . . . , vm)

∏k

π∗X(vk))' (1− qX)πX

qX =∏

Vj∈π1

qVjX

∏Vj |π∗X(vj=0)≤π∗

X(vj=1)

qVjX

πX =∏

Vj∈π1

π∗X(1)∏

Vj∈π0

π∗X(0)∏

Vj /∈π0∪π1

max(π∗X(1), π∗X(0))

Thus, the complete expression for the maximization is as fol-lows:

max{vk}

(P (x|v1, . . . , vm)

∏k

π∗X(vk))'{

(1− qX)πX if x = 1∏Vk

max(qVkXπ∗X(1), π∗X(0))if x = 0

The above expression is then substituted instead of the maxi-mization in the computation ofbel∗(x) andπ∗Ui

(x), to compute

their approximationsbel∗(x) and π∗Ui

(x), respectively. Theapproximation forλ∗X(vj) follows the same reasoning withtwo modifications: (1) the maximization does not includeVj ,

which we address by replacingqX andπX with q(j)X andπ

(j)X

presented below, respectively; (2) forvj=1, Vj∈π1, whichmakes the search for otherπ1’s members unnecessary; there-fore we useq(j)

X1 presented below instead ofq(j)X . We approxi-

mateλ∗X(vj) as follows:

λ∗X(vj) =

22

max( ∏

i λ∗Ui(0)

∏k 6=j max(qVkXπ∗X(1), π∗X(0)),∏

i λ∗Ui(1) (1− q

(j)X )π(j)

X

)if vj = 0

max(qVjX

∏i λ∗Ui

(0)∏

k 6=j max(qVkXπ∗X(1), π∗X(0)),∏i λ∗Ui

(1) (1− qVjXq(j)X1)π

(j)X

)if vj = 1

In the expression forλ∗X(vj), q(j)X , q

(j)X1, andπ

(j)X are defined

using the following expressions:

q(j)X =

∏Vk 6=j∈π1

qVkX

∏Vk 6=j |π∗X(vk=0)≤π∗

X(vk=1)

qVkX

q(j)X1 =

∏Vk 6=j |π∗X(vk=0)≤π∗

X(vk=1)

qVkX

π(j)X =

∏Vk 6=j∈π1

π∗X(1)∏

Vk 6=j∈π0

π∗X(0)∏

Vk 6=j /∈π0∪π1

max(π∗X(1), π∗X(0))

The boundary conditions for the childless and parentless nodesare as follows. (1) An unobserved symptom node is assumedto receive messageλ∗U0

={1, 1}. (2) A symptom node ob-served as 1 or 0 receivesλ∗U0

={0, 1} or λ∗U0= {1, 0}, respec-

tively. (3) For a parentless nodeqX=0, πX=P (X=1), and∏k 6=j max(qVkXπ∗X(1), π∗X(0))=P (X=0).

23

non-deterministic fault localization in communication ...sethi/papers/03/tr2003-03.pdf · using...

Documents