sherlock is around: detecting network failures with local ...liu/infocom_12_qiang.pdf · nodes such...

Sherlock is Around: Detecting Network Failureswith Local Evidence Fusion

Qiang Ma1, Kebin Liu2, Xin Miao1, Yunhao Liu1,21 Department of Computer Science and Engineering, Hong Kong University of Science and Technology

2 MOE Key Lab for Information System Security, School of Software,Tsinghua National Lab for Information Science and Technology, Tsinghua University

{maq, kebin, miao, liu}@greenorbs.org

Abstract—Traditional approaches for wireless sensor networkdiagnosis are mainly sink-based. They actively collect globalevidences from sensor nodes to the sink so as to conduct central-ized analysis at the powerful back-end. On the one hand, longdistance proactive information retrieval incurs huge transmissionoverhead; On the other hand, due to the coupling effect betweendiagnosis component and the application itself, sink often failsto obtain complete and precise evidences from the network,especially for the problematic or critical parts. To avoid largeoverhead in evidence collection process, self-diagnosis injectsfault inference modules into sensor nodes and let them makelocal decisions. Diagnosis results from single nodes, however,are generally inaccurate due to the narrow scope of systemperformances. Besides, existing self-diagnosis methods usuallylead to inconsistent results from different inference processes.How to balance the workload among the sensor nodes in adiagnosis task is a critical issue. In this work, we present a newin-network diagnosis approach named Local-Diagnosis (LD2),which conducts the diagnosis process in a local area. LD2 achievesdiagnosis decision through distributed evidence fusion operations.Each sensor node provides its own judgements and the evidencesare fused within a local area based on the Dempster-Shafertheory, resulting in the consensus diagnosis report. We implementLD2 on TinyOS 2.1 and examine the performance on a 50 nodesindoor testbed.

I. INTRODUCTION

Wireless sensor networks (WSNs) have been widely usedin many critical application domains such as environmentmonitoring, infrastructure protection, and habitat tracing [8],[23]. These systems often need to sustain for years, and operatereliably in the context of real world communications. Sensornodes, however, are error-prone and subject to componentfaults, performance degradations and even major system fail-ures in real world deployments [22], [14], [17], [20], [12],[19]. Thus, accurate and real-time fault diagnosis plays avery important role in WSN system operations. Compared toconventional networks, it is more challenging to explore theroot causes for WSNs when abnormal symptoms are observed.First, due to the negative impact of noisy environments andthe ad-hoc feature of WSNs, it is difficult for developersto deeply delve into the in-network behaviors among thesensor nodes, especially for large-scale networks in whichthe forwarding infrastructure dynamically changes as topologyor sensor activity varies [10]. Second, the sensor nodes havelimited power and computing capability to carry out advanced

Fig. 1. A part of CitySee deployment

network diagnosis programs. Third, the existence of a largevariety of specific protocols for WSNs also exacerbates thedebugging and diagnosis problems.

This work is motivated from our ongoing urban carbondioxide sensing project, CitySee. CitySee carries out severalapplications such as carbon emissions monitoring and atmo-spheric concentrations of carbon dioxide estimating. Figure 1shows a part of the CitySee system with 494 sensor nodes(CitySee totally deploys 1196 nodes). During the operationperiod, we observe frequent abnormal symptoms in the net-work such as high data loss, temporary disconnection of nodesin a certain region, and the like. To troubleshoot the rootcauses of these symptoms, we have applied different failuredetection approaches. Existing approaches are generally sink-based. They expect to retrieve detailed metrics from managednodes such as remaining energy, MAC layer backoff, neighbortable, routing table, etc. The network administrators thenconduct comprehensive analysis at the back-end.

In practice, however, it is difficult to apply sink-basedapproaches in large-scale WSNs. On the one hand, a large-scale WSN usually consists of hundreds or thousands of sensornodes, such that most of the nodes have to forward their pack-ets to the sink through many hops. Besides, most existing sink-based approaches require to collect more diagnosis data than

application data. Therefore, proactive information retrievalin a large-scale network generally incurs huge transmissionoverhead. On the other hand, due to the unreliable nature ofwireless communications, sink often fails to obtain completeevidences from the network. In addition, it is more difficult toretrieve expected information from a problematic region.

Instead of back-end analysis, self-diagnosis approaches in-ject fault inference modules into sensor nodes and let themmake local decisions. These approaches, however, may sufferfrom the narrow scope of single node. Some of the self-diagnosis methods like TinyD2 propose to let fault detectorstravel among different sensors [11]. Such a method, however,does not well handle the diverse judgements from differentnodes so that it leads to inconsistent diagnosis reports atdifferent network regions. Since single node only has limitedcomputation and energy resources, we have to balance theworkload among nodes in an in-network diagnosis process,while guarantee that an integrated judgement can be achieved.

To address these issues, we present a new in-networkdiagnosis approach called Local-Diagnosis (LD2). Instead ofretrieving the system state information to sink, LD2 carriesout the diagnosis process in a local area. We first select somesensor nodes in a local area where abnormal symptoms areobserved. We then conduct a process of evidence fusion toexplore the most possible root causes. Eventually a diagnosisreport is generated and sent to the sink. Compared withexisting approaches, LD2 achieves very high energy efficiency.In CitySee, we observe that almost all the root causes can bepartly reflected by the sensor nodes within a neighborhood,while each single sensor node has very limited knowledge tocomplete a comprehensive diagnosis process. Therefore, LD2actually not only avoids a large transmission overhead andinformation loss on the way to sink, but also achieves highdiagnosis accuracy. Moreover, it provides in-time diagnosisresults since it utilizes the first-hand evidence without delayin the collection process.

In order to deal with the narrow scope of single nodes andinconsistent judgements from different nodes, we train a NaiveBayesian Classifier (NBC) which encodes the probabilisticcorrelation between a set of state attributes and root causes.We insert this Bayesian classifier into the sensor nodes tocompute the posterior probability distribution of possible rootcauses according to their states. That is, each node translates itslocal evidences (i.e., state attributes) into our communicationmodel which is based on Dempster-Shafer theory (D-S theory).Using D-S theory, we design a novel strategy to fuse thejudgements from all the sensor nodes within a neighborhoodin a flexible and load-balanced manner. Once a node hasdetected some abnormal symptoms, it constructs a tree in alocal area to strictly depict the fusion order, so as to ensurediagnosis coverage and make fusion process as a long stringof sequential operations. The advantages of this serial mannerare two-fold. First, instead of collecting all the information onone certain node, we divide the work of evidence fusion intomany steps, thus each node can summarize the evidences ofits child nodes. That is, we let all the nodes in a local area

share the responsibility of diagnosis. Second, the contributionof each node converges on the root node of the tree, so thatwe can easily ensure a local consensus to the final diagnosisresult. Moreover, our diagnosis result is consistent, regardlessof the fusion order.

The rest of this paper is organized as follows. Section IIsummarizes the related work. Section III describes the systemframework. Section IV presents our design and provides addi-tional techniques to deal with several practical issues on im-plementation. Section V shows performance evaluation resultsfrom real indoor testbed experiments. Section VI concludesthe work.

II. RELATED WORK

Most existing approaches for sensor network diagnosis aresink-based, in which each sensor reports its state informationto the sink by periodically transmitting specific control mes-sages. Sympathy [17] and Emstar [4] rely heavily on an add-inprotocol that generates a large amount of status informationfrom individual sensor nodes to the sink, introducing highoverhead to the resource constrained sensor networks. Inorder to minimize the overhead, some researchers propose toestablish certain inference models by marking the data packets[13] [14], and then parse the results at the sink to infer thenetwork status. Claivoyant [25] is a notable tool which focuseson debugging sensor nodes at the source-level, and enablesdevelopers to wirelessly connect to remote sensors and executedebugging commands. Declarative TracePoints [2] allows thedevelopers to insert a group of action-associated checkpointsat runtime. Find [5] detects faculty nodes by ranking sensingreadings collected from natural event detection while thenetwork performs its routine tasks. Agnostic Diagnosis [15]discovers the silent failures by tracking the changes andanomalies of correlation graphs between the system metrics.MintRoute [21] visualizes the network topology by collectingneighbor tables from sensor nodes. LiveNet [3] provides a setof tools and techniques for reconstructing complex dynamicsof live sensor networks. In [1], the authors use monitoringpaths and cycles to localize single link and Shared Risk LinkGroup failures. Self-diagnosis [11] plants a finite state machineinto each sensor node, enabling them to accordingly changethe diagnosis state. Nevertheless, it proves difficult to achieveconsensus between the nodes as each node can change thestate whenever it receives the diagnosis requests.

Data fusion techniques are also related to this work. [7]outlines the ideas of D-S theory and presents the basic D-S fusion equation. It also mentions that D-S theory becomesunavailable when the evidences are significantly conflictingwith each other. [18], [24], [9] discuss how to change thecombination rules to eliminate the impact of conflicting evi-dences, while [6], [16] claim that a better way is to modifythe evidences, e.g., adding a basic confidence to describe theimportance for each evidence.

!

Local-Diagnosis

Diagnosis Request

Diagnosis Report

Evidence Fusion

Evidence Fusion Evidence FusionBack-end

LD2

State Attributes

NBC

Basic Probability Assignment

Diagnosis Trigger Fusion Tree Evidence

FusionDiagnosis

Report

Evidence Filter

Fig. 2. The system framework of LD2

III. SYSTEM FRAMEWORK

In this section, we introduce the general idea of ourapproach and show an overview of the system framework.As illustrated in Fig. 2, our main idea is to conduct thediagnosis process in a local area where abnormal symptomsare observed. Unlike the traditional sink-based approaches,in-network diagnosis has following challenges. First, the di-agnosis process is in a distributed manner, which means, allnodes in this neighborhood are required to be involved intothe diagnosis task. To make the nodes be able to cooperatewith each other, a feasible communication model for evidencedelivery among the nodes must be well designed. The mostnaive design is to select a cluster head which collects allthe state attributes from local nodes. Then some light-weightdiagnosis algorithms can be leveraged to find out the rootcauses. Nevertheless, this way is proved unfeasible as we canhardly find such a diagnosis algorithm that it can be wellplanted in a resource-limited sensor node as well as performsgood accuracy. Second, as mentioned above, to avoid puttingall the workload of diagnosis on a certain sensor, the diagnosisprocess should be dividable so that we are able to let eachnode join in the diagnosis work but not just deliver their localinformation to others. Third, for one sensor, its original localinformation only includes the state attributes such as energy,MAC layer backoff, neighbor table, routing table, etc. Theremust be some algorithms for the nodes to refine those roughdata, so as to generate some kinds of data for the fusion work.

The data-flow of LD2 is illustrated in Fig. 2. As we cansee, the basic module applies a Naive Bayesian Model whichencodes the probabilistic correlation between a set of stateattributes and root causes. After the transition, we get basicprobability assignment on the root causes. The module ofdiagnosis trigger determines whether some abnormal symp-toms exist. If so, it needs to start a new diagnosis process.For example, if some local state values experience abnormalchanges or some special events occur like a neighbor nodehas been removed from the neighbor for a long time, thetrigger module motivates a new diagnosis process to checkthat whether this neighbor has crashed or not.

At the very beginning of a diagnosis process, a fusiontree rooted by this sensor node is established. Each node

within this diagnosis area is involved in this tree. Actually thediagnosis process is divided into several operations of evidencefusion. Following this fusion tree, the diagnosis process ispartly executed by the intermediate nodes. The evidence fusionmodule is based on D-S theory which is also known as thetheory of belief functions. D-S theory is a generalization of theBayesian theory of subjective probability. By combining theunique characters of WSNs, we design an improved approachby assigning basic confidence to the evidences according totheir different significances for the fusion system.

IV. MAIN DESIGN

In this design, we consider both diagnosis efficiency anddiagnosis accuracy. On the one hand, in order to avoid longdistance evidence retrieval which leads to large communicationoverhead, we claim to carry out the diagnosis process locallybut not at the back-end. On the other hand, to improve ourdiagnosis accuracy, we need to take judgements from all nodeswithin a local area into consideration, as one single node onlyhas limited scope and computation ability. To make the nodescooperate with each other to complete the diagnosis work, aserial of interfaces and protocols need to be designed like in-node data processing, in-network communication model. Inthe following subsections, we show the design details of LD2.

A. Overview

Sink-based approaches retrieve state information from thenetwork so as to conduct centralized analysis. They believethat a big picture of network is greatly helpful for networkdiagnosis, as many evidences can be leveraged to validatethe diagnosis result. In CitySee, however, we observe thatalmost all the root causes can be partly reflected by the sensornodes within a neighborhood. For example, if a sensor nodehas crashed, its neighbors must realize that it stops sendingbeacon messages for neighbor discovering. When a route loopoccurs, by checking the network layer sequence number (e.g.,CTP sequence number), the nodes in the loop can realize thatsome packets are repeatedly forwarded. Locally diagnosingthe network achieves real-time diagnosis, avoiding informationlost on the collection path to the sink. At the same time,compared to one node, LD2 integrates more evidences tovalidate the result and thus achieves higher accuracy.

B. Naive Bayesian Classifier

Considering the resource limitation of sensor nodes, faultdetection within one single node should be achieved in energyefficient way and many existing models can be applied.For example, we can leverage simple rule-based models tomake decision based on the local evidences. Light-weightprobabilistic classifiers like the Naive Bayesian Classifier canalso be applied. We use a binary variable R to denote eachtype of root cause and P (R) and P (¬R) are the probabilitiesthat this failure occurs or not. Then in a diagnosis process,sensor node is able to calculate the posterior probability ofR given its local evidences, according to the Naive Bayesianmodel:

P (R|F1, F2, ...Fn) =1

P (F1,F2,...Fn)P (R)

�ni=1 P (Fi|R)

Where P (F1, F2, ...Fn) is a scaling factor which onlydepends on the evidences (F1, F2, ..., Fn). (F1, F2, ..., Fn)denotes the metrics of current sensor node as well as its neigh-bors. During the training stage, we should estimate the valueof P (R), P (¬R) and P (Fi|R). These parameter values canbe learned from the historical data. The storage cost of NaiveBayesian classifier is (1 + 2nr) parameters for each failuretype where n denotes the number of metrics and each metrichas r discrete values (Considering the computation capabilityof sensor nodes, we discretize the continuous metrics in thiswork to simplify the probability computation).

C. Evidence Fusion

How to make multiple nodes within a local area cooperatewith each other to detect network failures is non-trivial. Themain challenges are three-fold. First, communication aboutevidence transferring must use channel itself, which means,we have no out-of-band channel for diagnosis. If we incura large amount of transmission overhead during the evidencefusion, new network failures may happen, which is also knownas Heisenbug. Second, complicated fusion algorithms are notapplicable for the system as a sensor node is resource limited.For the same reason, the algorithm should be dividable to avoidputting much data at one node. Third, the algorithm must en-sure a local consensus to the final diagnosis report. Moreover,to achieve real-time diagnosis, the period of diagnosis processmust be short.

1) Improved Dempster-Shafer Theory: D-S theory is ageneralization of the Bayesian theory of subjective probability.It is based on two ideas: the idea of obtaining degrees of belieffor one question from subjective probabilities for a relatedquestion, and Dempster’s rule for combining such degrees ofbelief when they are based on independent items of evidence.

Suppose m1 and m2 are two basic probability assignments(i.e., mass function) over the frame of discernment W . In-tuitively, mi(U) describes the extent to which the evidencesupports U , where U ∈ 2W , i = 1, 2. The fusion formula byD-S theory is:

m12(Xi) =

�Aj∧Bk=Xi

m1(Aj)m2(Bk)

1−�

Aj∧Bk=φm1(Aj)m2(Bk)

if Xi �= φ

0 if Xi = φ

Where Xi, Aj , Bk ∈ 2W .k12 =

�Aj∧Bk=φ m1(Aj)m2(Bk) is called conflict factor

of two evidences m1 and m2. Notably, there can be twoevidences m1 and m2 which are totally conflicting such thatk12 = 1, therefore mass functions are not always combinable.What is more, even if they are not totally conflicting but highlyconflicting, that is, k12 → 1, the combination result alwaysgoes against the practical sense. Many works are proposed toaddress this issue. In general, they can be classified into twotypes. One is to modify the combination rules, while the otherone is to improve the evidence models.

The methods to modify the combination rules discuss twocases when the evidences are reliable and unreliable respec-tively. Nevertheless, they both mainly consider how to assignthe conflicting evidence, like how to decide the ratio betweenthe event possibilities when conflict happens. In [18], theauthors propose that on the basis of reliable evidences, themain reason of conflict is the incompleteness in the frameof discernment, i.e., some unknown event possibilities exist.[24], [9] all believe that not all the evidences are reliable.They propose that the conflicting part between the evidencesshould be discarded or reassigned to the other possibilities.In practice, when there are a large amount of evidencesneed to join in the fusion task, we hope that the evidencescan be grouped by some metrics such that we can conductthe fusion task regardless of the fusion order to reduce thecomputation work. Unfortunately, above improved methods allfail to support associative law. This work fully combines theunique characters of WSNs, and designs following improvedD-S theory for LD2.

Suppose the frame of discernment in our evidence model isW = {R0, R1, ...Rn}. W consists of different root causes{R1, R2, ...Rn} in the network. Besides, it also has a ba-sic event “no problem” R0, which indicates that no exactdiagnosis result is produced. We let each node Ni onlygenerates possibility value mi(Rj) for each single root causeRj according to its own local information. That is, mi(U) = 0for any U ∈ 2W and |U | > 1.

�0≤j≤n mi(Rj) = 1.

Definition 1. The distance between m1 and m2 is:

d(m1,m2) =�

12 (M1 −M2)T (M1 −M2)

Where Mi = [mi(R0),mi(R1), ...mi(Rn)]T , i = 1, 2, andwe also have 0 ≤ d(m1,m2) ≤ 1:

d(m1,m2) =�

12

�0≤j≤n (m1(Rj)−m2(Rj))2

=�

12

�0≤j≤n (m

21(Rj) +m2

2(Rj)− 2m1(Rj)m2(Rj))

≤�

12

�0≤j≤n (m

21(Rj) +m2

2(Rj))

≤�

12 [(

�0≤j≤n m1(Rj))

2 + (�

0≤j≤n m2(Rj))2] = 1

Definition 2. The similar degree of m1 and m2 is:

s(m1,m2) = 1− d(m1,m2)

As we can see, the greater similar degree of m1 andm2, the more similar analysis two evidences describe. If wehave one evidence which is similar to all the others, thenwe believe that this evidence is important. Suppose we haveN evidences e1, e2, ... eN , and their corresponding basicprobability assignments are m1, m2, ... mN .

Definition 3. The basic confidence of ei (i = 1, 2...N ) is:

βi =�

1≤j≤N,j �=i s(mi,mj)

To avoid huge computation cost, sometimes we can alsorandomly sample some evidences to compose a standard set

S, hence every evidence mi computes the total similar degreeto S as its basic confidence: βi =

�sj∈S,mi �=sj

s(mi, sj).After normalization, we get the relative importance of mi

to the evidence which has the greatest basic confidence:ψi = βi/max1≤j≤N βj . The normalization could be omittedwhen the fusion task is divided into small ones and a globalmaximum value is unknown, i.e., ψi = βi. Then we transferthe basic probability assignments by multiplying the basic

confidence, making them of equal importance in new fusionsystem:

m�i(Ri) = ψimi(Ri) ∀1 ≤ i ≤ n.m�

i(R0) = ψimi(R0) + (1− ψi)

Notably, if there are only two evidences involved in thefusion task. Because s(mi,mj) = s(mj ,mi), they havethe same basic confidence even if one of them providesinaccurate evidence. To address this issue, we need to set athreshold Ft such that more than Ft evidences are allowedto utilize Definition 3 to conduct evidence fusion. In ourimplementation, we set Ft equals 4.

In this new fusion system, for Ri (i=1,2... n), we reducethe impact of those evidences with less importance. That is, toconfirm whether Ri happens or not mainly relies on the otherevidences. On the contrary, for R0 we increase the impactof those evidences with less importance, so as to averagethe confidence to the other root causes. After transferring thebasic probability assignments, all the evidences are of equalimportance, then we can utilize D-S theory to conduct evi-dence fusion. Our improved D-S theory is designed for LD2’sevidence model. We choose not to change the combinationrules but refine the evidences. It also satisfies that the fusionresult keeps the same even if we change the fusion order, i.e.,supports associative law. It is very important for our design, aswe can’t ensure that the fusion tree has the same architectureall the time (details in subsection Fusion Algorithm).

Theorem 1. m�(12)3(Xi) = m�

1(23)(Xi).

PROOF. m�(12)3(Xi) =

�Aj∧Bk=Xi

m�12(Aj)m

�3(Bk)

1−�

Aj∧Bk=φm�

12(Aj)m�3(Bk)

=

�Aj∧Bk=Xi

(

�Cl∧Dt=Aj

m�1(Cl)m

�2(Dt)

1−�

Cl∧Dt=φm�

1(Cl)m

�2(Dt)

)m�3(Bk)

1−�

Aj∧Bk=φ(

�Cl∧Dt=Aj

m�1(Cl)m

�2(Dt)

1−�

Cl∧Dt=φm�

1(Cl)m

�2(Dt)

)m�3(Bk)

=

�Aj∧Bk=Xi

(

�Cl∧Dt=Aj

m�1(Cl)m

�2(Dt)

1−�

Cl∧Dt=φm�

1(Cl)m

�2(Dt)

)m�3(Bk)

�Aj∧Bk �=φ

(

�Cl∧Dt=Aj

m�1(Cl)m

�2(Dt)

1−�

Cl∧Dt=φm�

1(Cl)m

�2(Dt)

)m�3(Bk)

=�

Cl∧Dt∧Bk=Xim�

1(Cl)m�2(Dt)m

�3(Bk)�

Cl∧Dt∧Bk �=φm�

1(Cl)m�2(Dt)m�

3(Bk)

Similarly, we can prove that:

m�1(23)(Xi) =

�Cl∧Dt∧Bk=Xi

m�1(Cl)m

�2(Dt)m

�3(Bk)�

Cl∧Dt∧Bk �=φm�

1(Cl)m�2(Dt)m�

3(Bk)

= m�(12)3(Xi)

2) Fusion Algorithm: In this part we present our specificalgorithms for evidence fusion. First we introduce the fusiontree. Every diagnosis process is rooted by a node, whichdetects abnormal network symptoms such as node crash, trafficcontention, route loop and so on. It triggers a diagnosis modelas well as determines the diagnosis area, then broadcastsdiagnosis request beacons (DREQ) to establish the fusion tree.Basically we need to determine the diagnosis area for eachsymptom. For example, to find out whether a node has crashedor not can ask for one-hop neighbors’ evidences, while to findout whether a route loop exists or not should visit all thenodes transmitting the relevant packets. All the informationmust be involved in DREQ for the nodes to know about thedetails of diagnosis task. To the other nodes, after receivingDREQ, they are involved in the fusion tree if they locate in thediagnosis area. Once they join in the fusion tree, they shouldkeep broadcasting DREQ to inform other related nodes. Inthe process of constructing fusion tree, each node records itsparent node and child nodes for following evidence collection.

Notably, in the process of establishing fusion tree, the rootalso needs to sample a standard set for following evidencefusion. As mentioned above, utilizing standard set can reducethe computation cost. Besides, in our fusion system, it greatlyreduces the transmission overhead as we have no need tocollect all the evidences to assign a global basic confidence.In LD2, we make the standard set consist of the evidencesfrom the root node and its direct child nodes. Every DREQpacket contains standard set such that each node in the fusiontree is able to calculate its own basic confidence respectively.

The establishment of fusion tree finishes until no DREQis transmitted. What follows is the evidence fusion process.First all the leaf nodes send out a leaf-query beacon (LQUE)to make sure that it indeed has no child node in the fusion tree.Actually LQUE is used to make up the lost DREQ. If there is

Algorithm 1 Fusion Tree Establishment Algorithm1: Denote the node ID as id.2: Initiate a null value parentID to record the parent ID.3: Initiate an empty set Schildren to record the child nodes.4: if Trigger component detects abnormal symptoms then5: Trigger a local-diagnosis process.6: Determine diagnosis area.7: Sample the standard set for the fusion system.8: Broadcast request beacon (DREQ).9: end if

10: if Receive DREQ then11: if Already in the tree then12: Check the parent filed of this DREQ, denoted as p.13: if p = id then14: Add the source node of this DREQ into Schildren.15: end if16: else if In the diagnosis area then17: Assign the source ID of this DREQ to parentID.18: Update the parent filed value by parentID.19: Broadcast DREQ.20: end if21: end if

no reply to the LQUE, it transfers its local evidence (DEVI)to its parent node in the tree. Otherwise it updates its childset and waits for the evidences from the child nodes. To theintermediate nodes, they must collect all the evidences fromits child nodes and finally sends the fusion result to its parentnode. To avoid evidence lost, each intermediate node is able to“remind” its child nodes by broadcasting child-query beacons(CQUE), hence the lost evidences can be retransmitted.

As we can see, the structure of fusion tree has muchdynamics as we connect the nodes by broadcasting DREQ.What is more, the fusion order strictly follows the fusion treefrom leaf nodes to the root node. Fortunately, in Theorem

1 it proves that the fusion result of LD2 keeps consistent,regardless of the fusion order. This character helps our fusionsystem greatly reduce the maintainance overhead of fusiontree, as well as enable to ignore the impact of networktopology.

V. EVALUATION

We evaluate LD2 through a real indoor testbed consisting of50 TelosB motes. Two metrics are mainly used for evaluatingLD2’s accuracy: false negative rate (i.e., miss detection rate)and false positive rate (i.e., false alarm rate). False negativerate is defined as the proportion of faulty cases which aredetected as normal, while false positive rate is defined as theproportion of normal cases which are detected as faulty.

Basically we implement a CTP application in the network,for the analysis of impact with different diagnosis approaches.In this work, we implement two modules for diagnosing thenetwork: LD2 and TinyD2. TinyD2 presents the concept ofself-diagnosis which encourages each sensor node to run aembedded finite state machine to find out the root cause.

Algorithm 2 Evidence Fusion Algorithm1: if Schildren is empty (leaf node) then2: Broadcast leaf-query beacon (LQUE) to ensure that it

is a leaf node.3: if No reply to LQUE then4: Transmit local evidence (DEVI) to its parent.5: else6: Update the child set Schildren.7: end if8: else9: Maintain an evidence set Sevidence to record the re-

ceived evidences from the child nodes.10: Add local evidence into Sevidence.11: while Receive DEVI do12: Add this DEVI into Sevidence.13: end while14: for each Childi in Schildren do15: if Sevidence does not contain the DEVI from Childi

then16: Transmit child-query beacon (CQUE) to Childi.17: end if18: end for19: if Has Collected all the evidences from child nodes

then20: Evidence Fusion.21: Transmit Sevidence to parent node.22: end if23: end if24: if Receive LQUE then25: Check the source ID of this LQUE, denoted as p.26: if p = parentID then27: Reply to this LQUE.28: end if29: end if30: if Receive CQUE then31: Transmit local DEVI.32: end if

During the tests, we manually inject three types of failures:node crash, traffic contention and the route loop. For eachfailure, we conduct 60 cases. We also change the power levelto discuss the performance of two approaches in differentdiagnosis densities (i.e., the number of neighbor nodes).

A. Time Cost

Figure 3(a) illustrates the network topology of our testbedconsisting of 50 motes. First we discuss the time cost duringthe diagnosis process. Generally we divide the cost into twoparts: fusion tree establishment and evidence fusion process.As mentioned in section IV, the fusion tree is related to thediagnosis area which is determined by the symptoms. In theexperiments, we make above three network failures have thesame diagnosis area. Node 25 (i.e., the red mote) has 16neighbors (i.e., the green motes). When node 25 crashes ortraffic contention occurs at node 25, the diagnosis area involves

6

13

4

12

19

23

28

5

38

39

10

27

22

21

33

37

3

18

sink

7

31

36

32

16

14

15

17

2535

2

24

20

29

1

26

8

9

34

11

30

48

49

43

47

41

46

42

45

44

40

Sensor mote

Neighbor of node 25

Node 25

Link

Route loop

Node 13

(a) Network Topology with 50 nodes

1323

28

27

22

33

3736

16

14

15 2535

24

26

34

45

(b) 36%

1323

28

27

22

33

3736

16

14

15 2535

24

26

34

45

(c) 31%

1323

28

27

22

33

3736

16

14

15 2535

24

26

34

45

(d) 18%

1323

28

27

22

33

3736

16

14

15 2535

24

26

34

45

(e) 9%

Fig. 3. Testbed topology. We inject three failures in (a) respectively,making their diagnosis area the same: node 25 (red mote) is crashed; trafficcontention occurs at node 25; a route loop (blue arrows) exists among node25’s neighbors (green motes). Besides, we let node 13 (i.e., the blue mote in(b),(c),(d),(e)) trigger the diagnosis process. The tree structure of (b),(c),(d)and (e) respectively occurs 36%,31%,18% and 9% in all the cases.

all its neighbors. Besides, we let a routing loop exist among allthese neighbors (i.e., the blue arrows). That is, their diagnosisarea is the neighbors of node 25. We also make node 13 asthe root node of fusion tree.

Figures 3(b) 3(c) 3(d) and 3(e) describe four of mostfrequent structures when we are establishing the fusion tree.Figure 4 shows the time cost of sampling evidences and estab-lishing fusion tree. As mentioned in section IV, the process ofsampling evidences is used to assign a local basic confidence

to each node in evidence fusion, while establishing fusion treemainly includes broadcasting and receiving beacons. As wecan see, the time cost is stable for all the tree structures, i.e.,about 19ms in sampling evidences and 39ms in establishingfusion tree. Figure 5 shows the CDF of the time cost ofevidence fusion in three diagnosis process. In 80% of casesfor detecting node crash, LD2 finishes evidence fusion ina 16-node area within 95ms. For traffic contention, it costsmore than 133ms for 60% of cases as the DEVI packetcontains 3 possible root causes (i.e., ingress overflow, egressoverflow, bad link) thus more combination work is needed.Figure 6 depicts the CDF of the total time cost for diagnosingnode crash, traffic contention and route loop respectively. Weobserve similar trends in two CDF figures as the process ofevidence fusion costs most of time in LD2.

Tree1 Tree2 Tree3 Tree4 Others0

10

20

30

40

50

Tim

e Co

st (m

s)

Tree Structure

Sampling evidences Establishing fusion tree

Fig. 4. Time cost of sampling evidences and establishing fusion tree

80 90 100 110 120 130 140 150 1600.0

0.2

0.4

0.6

0.8

1.0

CDF

Time Cost

Node Crash Traffic Contention Route Loop

Fig. 5. CDF of evidence fusion’s timecost

140 160 180 200 2200.0

0.2

0.4

0.6

0.8

1.0

CDF

Time Cost

Node Crash Traffic Contention Route Loop

Fig. 6. CDF of total time cost

B. Diagnosis Accuracy

Figures 7 ∼ 12 illustrate the diagnosis results of detect-ing node crash, traffic contention and route loop with LD2and TinyD2 respectively. According to Fig. 7, LD2 enablesto troubleshoot more than 92% of crashed nodes, and thefalse negative rate decreases when the number of neighborsincreases. It is well understood that once a node is crashed, itsneighbor must find that it is removed from the neighbor tablesfor a long period. Therefore, the more neighbors, the moredeterminate diagnosis. As showed in Fig. 10, the false positiverate of LD2 is around 12% over varying diagnosis densities.For detecting route loop, each node produces its evidence bychecking the CTP sequence number. As illustrated in Fig. 9and 12, LD2 indeed maintains low false negative rate and falsepositive rate, i.e., 5% and 6%, which means that LD2 cansuccessfully explore about 95% of route loops. By contrast,TinyD2 performs unstable to detect crashed nodes and routeloops under different diagnosis densities. When the densityincreases, TinyD2 often fails to achieve a consensus amongthe nodes, such that hardly determines a root cause.

According to Fig. 8, LD2 correctly explores about 86% oftraffic contention, while TinyD2 is able to find out 78% ofcases when the number of neighbors is 16. Traffic contentionoccurs due to some reasons, such as egress overflow, ingressoverflow and bad link. It proves difficult for TinyD2 to usefinite state machine to achieve an accept state. In Fig. 11, aswe can see, TinyD2’s false positive rate increases to 22% whenthe number of neighbors is 16, while LD maintains around

5 8 10 12 160.00

0.08

0.16

0.24

0.32

0.40F

alse

neg

ativ

e ra

te

Number of neighbors

TinyD2 LD2

Fig. 7. False negative rate for node crash

5 8 10 12 160.00

0.08

0.16

0.24

0.32

0.40

Fal

se n

egat

ive

rate

Number of neighbors

TinyD2 LD2

Fig. 8. False negative rate for traffic contention

5 8 10 12 160.00

0.08

0.16

0.24

0.32

0.40

Fal

se n

egat

ive

rate

Number of neighbors

TinyD2 LD2

Fig. 9. False negative rate for route loop

5 8 10 12 160.00

0.08

0.16

0.24

0.32

0.40

Fal

se p

ositi

ve r

ate

Number of neighbors

TinyD2 LD2

Fig. 10. False positive rate for node crash

5 8 10 12 160.00

0.08

0.16

0.24

0.32

0.40

Fal

se p

ositi

ve r

ate

Number of neighbors

TinyD2 LD2

Fig. 11. False positive rate for traffic contention

5 8 10 12 160.00

0.08

0.16

0.24

0.32

0.40

Fal

se p

ositi

ve r

ate

Number of neighbors

TinyD2 LD2

Fig. 12. False positive rate for route loop

20000 30000 40000 50000 6000010

15

20

25

30

35

40

45

50

NodeID

Timeline (ms)

Fig. 13. Packet collection in LD2

16% under different diagnosis densities.

C. Coupling Effect with Application

Finally we discuss the coupling effect between the appli-cation and network diagnosis. We observe that most of sink-based approaches needs to retrieve more network informationthan that generated by the application. It proves unreasonablebecause some network failures such as traffic contention, bad

20000 30000 40000 50000 6000010

15

20

25

30

35

40

45

50

NodeID

Timeline (ms)

Fig. 14. Packet collection in TinyD2

routing, can occur due to frequent large-amount collection.TinyD2 reduces the transmission overhead by broadcastingfault detector in the air. In practice, however, it lacks of aspecific order to control the diagnosis and ensure a consensusresult. What is more, once it can’t achieve the accept state,much extra transmissions are required. Figure 13 and 14 depictthe packet collection of those nodes in the diagnosis area,and each dot indicates an application packet collected by the

26000 28000 30000 32000 34000

0

5

10

15

20

25

30

35N

umbe

r of b

eaco

ns in

200

ms

Timeline (ms)

LD2_node14 LD2_node15 LD2_root TinyD2_node14 TinyD2_node15

Fig. 15. Beacon transmissions in diagnosis process

sink. As we can see, when we utilize TinyD2 to conduct thediagnosis process in the local area around timeline 30000ms,40000ms and 50000ms, most application packets are lost. Tofind out the root cause, we also sniffer the beacons in this area.As illustrated by Fig. 15, in the diagnosis process, every nodein TinyD2 generates about 28 beacons within 200ms, whichprobably causes a local traffic contention. By contrast, the rootnode and intermediate nodes in LD2 only cause about 15 and10 beacons in 200ms.

VI. CONCLUSION

Long distance proactive information retrieval in traditionalsink-based approaches to diagnosing WSNs often incurs alarge amount of transmission overhead. What is more, sink-based approaches can not afford real-time diagnosis. Con-versely, sensor nodes have the first-hand evidences to conductdiagnosis process, but due to the narrow scope of system stateinformation, diagnosis results from single nodes are generallyinaccurate. To balance this tradeoff, this work presents LD2,which conducts the diagnosis process in a local area. LD2claims to distribute the diagnosis workload to the sensor nodeswithin a diagnosis area. By constructing a fusion tree, eachnode summarizes the evidences of its child nodes, such that thecontribution of each node converges on the root node. Thus,a local consensus to the final diagnosis report is generatedand reported to the sink. We also implement LD2 on TinyOS2.1 and evaluate the performance on a real indoor testbedconsisting of 50 nodes.

ACKNOWLEGMENT

This research is supported in part by the NSFC Distin-guished Young Scholars Program under Grant No. 61125202,NSFC under Grant No. 61103187, and China PostdoctoralScience Foundation under Grant No. 2011M500330.

REFERENCES

[1] S.S. Ahuja, S. Ramasubramanian, and M.M. Krunz. Single-link failuredetection in all-optical networks using monitoring cycles and paths.IEEE/ACM Transactions on Networking, 17(4):1080–1093, 2009.

[2] Q. Cao, T. Abdelzaher, J. Stankovic, K. Whitehouse, and L. Luo.Declarative tracepoints: a programmable and application independentdebugging system for wireless sensor networks. In Proceedings of ACM

SenSys, Raleigh, NC, USA, 2008.[3] B. Chen, G. Peterson, G. Mainland, and M. Welsh. Livenet: Using pas-

sive monitoring to reconstruct sensor network dynamics. In Proceedings

of IEEE DCOSS, Santorini Island, Greece, 2008.[4] L. Girod, J. Elson, A. Cerpa, T. Stathopoulos, N. Ramanathan, and

D. Estrin. Emstar: a software environment for developing and deployingwireless sensor networks. In Proceedings of the USENIX Annual

Technical Conference, Boston, MA, 2004.[5] S. Guo, Z. Zhong, and T. He. Find: faulty node detection for wireless

sensor networks. In Proceedings of ACM SenSys, Berkeley, California,2009.

[6] R. Haenni. Are alternatives to dempster’s rule of combination alter-natives? comments on “about the belief combination and the conflictmanagement problem”. Int. J. Information Fusion, 3(3):237–241, 2002.

[7] D. Koks and S. Challa. An introduction to bayesian and dempster-shaferdata fusion. Technical Report DSTOTR1436, Australian Department of

Defence, 2003.[8] J. Kong, J. Cui, D. Wu, and M. Gerla. Building underwater ad-

hoc networks and sensor networks for large scale real-time aquaticapplications. In Proceedings of IEEE MILCOM, Atlantic City, New

Jersey, 2005.[9] E. Lefevre, O. Colot, P. Vannoorenberghe, and D. De Brucq. A

generic framework for resolving the conflict in the combination of beliefstructures. In Proceedings of International Conference on Information

Fusion, Paris, France, 2000.[10] Z. Li, M. Li, J. Wang, and Z. Cao. Ubiquitous data collection for mobile

users in wireless sensor networks. In Proceedings of IEEE INFOCOM,

Shanghai, China, 2011.[11] K. Liu, Q. Ma, X. Zhao, and Y. Liu. Self-diagnosis for large scale

wireless sensor networks. In Proceedings of IEEE INFOCOM, Shanghai,

China, 2011.[12] S. Liu, G. Xing, H. Zhang, J. Wang, J. Huang, M. Sha, and L. Huang.

Passive interference measurement in wireless sensor networks. InProceedings of IEEE ICNP, Kyoto, Japan, 2010.

[13] Y. Liu, K. Liu, and M. Li. Passive diagnosis for wireless sensornetworks. IEEE/ACM Transactions on Networking, 18(4):1132–1144,2010.

[14] E. Magistretti, O. Gurewitz, and E. Knightly. Inferring and mitigatinga link’s hindering transmissions in managed 802.11 wireless networks.In Proceedings of ACM MobiCom, Chicago, Illinois, USA, 2010.

[15] X. Miao, K. Liu, Y. He, Y. Liu, and D. Papadias. Agnostic diagnosis:Discovering silent failures in wireless sensor networks. In Proceedings

of IEEE INFOCOM, Shanghai, China, 2011.[16] C.K. Murphy. Combining belief functions when evidence conflicts.

Decision support systems, 29(1):1–9, 2000.[17] N. Ramanathan, K. Chang, R. Kapur, L. Girod, E. Kohler, and D. Estrin.

Sympathy for the sensor network debugger. In Proceedings of ACM

SenSys, San Diego, USA, 2005.[18] P. Smets. The combination of evidence in the transferable belief

model. IEEE Transactions on Pattern Analysis and Machine Intelligence,12(5):447–458, 1990.

[19] R. Tan, G. Xing, Z. Yuan, X. Liu, and J. Yao. System-level calibration forfusion-based wireless sensor networks. In Proceedings of IEEE RTSS,

San Diego, CA, USA, 2010.[20] X. Wang, L. Fu, and C. Hu. Multicast performance with hierarchical

cooperation. IEEE/ACM Transactions on Networking, 20(3):1–1, 2011.[21] A. Woo, T. Tong, and D. Culler. Taming the underlying challenges of

reliable multihop routing in sensor networks. In Proceedings of ACM

SenSys, Los Angeles, CA, 2003.[22] K. Xing, F. Liu, X. Cheng, and D.H.C. Du. Real-time detection of clone

attacks in wireless sensor networks. In Proceedings of IEEE ICDCS,

Beijing, China, 2008.[23] N. Xu, S. Rangwala, K.K. Chintalapudi, D. Ganesan, A. Broad,

R. Govindan, and D. Estrin. A wireless sensor network for structuralmonitoring. In Proceedings of ACM SenSys, Baltimore, Maryland, 2004.

[24] R.R. Yager. On the dempster-shafer framework and new combinationrules. Information sciences, 41(2):93–137, 1987.

[25] J. Yang, M.L. Soffa, L. Selavo, and K. Whitehouse. Clairvoyant: acomprehensive source-level debugger for wireless sensor networks. InProceedings of ACM SenSys, Sydney, Australia, 2007.

sherlock is around: detecting network failures with local ...liu/infocom_12_qiang.pdf · nodes such...

Documents