deployments made easy: essentials of managing a (rural ...dev3.acmdev.org/papers/dev-final20.pdf ·...

Deployments Made Easy: Essentials Of ManagingA (Rural) Wireless Mesh Network

Vijay Gabale∗

IBM Research, [email protected]

Rupesh Mehta†

TCS, [email protected]

Jeet Patani‡

Microsoft, [email protected]

Ramakrishnan K§

TCS, [email protected]

Bhaskaran RamanIIT Bombay, India

[email protected]

ABSTRACTIn this work, we present our experiences of managing thedeployment of a wireless mesh network to support real-timevoice services in a village near Mumbai, India. We focuson three essential aspects of our deployment: (1) in-networkmechanisms for ease of network planning, (2) network man-agement and data collection in an operational network, and(3) fault-tolerance mechanisms for long-term network suste-nance. Especially for rural deployment, where the amount ofresources on the field are limited and frequent physical visitsare costly, the consideration of these three aspects drasti-cally simplified our deployment and measurement activities.Our in-network mechanisms constantly provided the desirednetwork feedback to meet operational challenges while on thefield. We carefully designed and implemented a low over-head and non-intrusive network management module overour TDMA based wireless mesh network. During our de-ployment, this module successfully diagnosed several net-work faults in a live network and collected required networkstatistics without affecting the primary application. We alsoimplemented a set of fault-tolerance mechanisms in our pro-totype, and during our deployment, our network proved it-self to be robust to various network failures. The villagersused our network for more than a month and availed morethan hundred voice calls comprising of local calls within thevillage and remote calls to the phones in outside world.

∗This work was done when the author was a research scholarat IIT Bombay, India.†This work was done when the author was a research assis-tant at IIT Bombay, India.‡This work was done when the author was a masters studentat IIT Bombay, India.§This work was done when the author was a research assis-tant at IIT Bombay, India.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DEV’13 January 11-12, 2013 Bangalore IndiaCopyright 2013 ACM 978-1-4503-1856-3/13/01 ...$15.00.

Categories and Subject DescriptorsC.2.1 [Computer-Communication Networks]: NetworkArchitecture and Design—Centralized networks, Wireless com-munication

General TermsDesign, Experimentation

Keywords802.15.4, TDMA-based multi-hop MAC, Voice applications

1. INTRODUCTION AND MOTIVATIONIn recent years, wireless mesh networking has emerged as

a promising alternative to traditional cellular networks toprovide communication services in rural areas [1]. Due tothe availability of cost-effective hardware, researchers andpractitioners have especially focused on developing 802.11-based [2] mesh networks [3, 4] to provide broadband con-nectivity to villages in developing regions. Although, therehave been several deployments of such networks in opera-tional settings [5, 6], there are very few attempts (e.g., [7])to document and resolve operational challenges faced dur-ing deployments of such networks. In this paper, we de-scribe a systematic approach and our experiences of deploy-ing and managing a wireless mesh network to provide voicetelephony services in a village, near Mumbai, India.

As against the traditional mesh networks based on 802.11technology, in our prior work, we took a drastically differentapproach, and designed a wireless mesh network, Lo3 [8],using 802.15.4 [9] (commonly known as Zigbee) technol-ogy. 802.15.4 platforms have low cost (USD40) and lowpower requirements (100mW), and we believe that 802.15.4is more suitable for deployments in rural areas. Thus weuse 802.15.4-based wireless mesh network to enable real-timevoice services in rural areas, and seek a significant trade-offin terms of system cost and power consumption in com-parison to traditional cellular-based or WiFi-based systemswhich can support broadband connectivity and real-timevideo applications.

As shown in Fig. 1, in Lo3 (Low cost, Low Power, LocalVoice), we enable real-time voice services to provide a lo-cal voice within the village and a remote voice to connectthe village to the outside world. For voice calls outside the

Figure 1: Lo3 system architecture

village, we utilize the presence of cellular points of cover-age which are typically found within 2-3km away from thevillage.

To realize Lo3 in practice, in our prior work, we devel-oped a prototype of 802.15.4 handset [10] (henceforth Lo3

handset) capable of voice processing and streaming. We alsodesigned a gateway node [11], which has a cellular interfaceand an 802.15.4 interface, to extend cellular connectivityinside villages. The downside of 802.15.4 is its low radiocapacity of just 250Kbps. However, to enable voice appli-cations on such impoverished radio, we designed and fullyimplemented a light-weight TDMA (Time Division Multi-ple Access) based MAC protocol, LiT [12]. We implementedLiT on Lo3 handset, Lo3 gateway, and TelosB [13] platforms.In our network, the TelosB nodes act as the infrastructurenodes to support the voice calls. We test deployed Lo3 net-work comprising of 3 handsets, 6 infrastructure nodes and1 gateway node in Ahupe village, near Mumbai, India [14].The villagers used our system to establish several voice callswithin the village and voice calls to phones in outside world.

1.1 Challenges in network deploymentOur deployment experience showed that, managing de-

ployment of a wireless mesh network can become cumber-some and frustrating if one is not equipped with appropriatetools to meet the operational challenges. The first challenge,at the time of deployment, lies in selecting a node posi-tion and fixing an antenna height; a task which can becomeseverely difficult and time consuming if immediate feedbackregarding the link quality and node stability is not avail-able. Secondly, measurement of network performance canbe a significant overhead if there is no in-network mech-anism to automate the data collection at a central point.Frequent data collection from individual nodes may requireconsiderable running-around and can be substantially time-consuming. Thirdly, the network nodes or links can fail dueto unpredictable events, and it is critical to provide fault-tolerance mechanisms to such failures for sustainable deploy-ment. Such mechanisms are especially important for ruraldeployment since a downtime of several hours not only re-sults in loss of revenue but may also result in loss of faith.Further, frequent physical visits for fault-diagnosis may notbe affordable; each visit to Ahupe village costs us USD150,not accounting for personnel time.

1.2 Our contributionTo tackle above mentioned challenges, in this work, we fo-

cus on three essential aspects of deployment: (1) in-network

mechanisms for ease of network planning, (2) network man-agement and data collection in an operational network, and(3) fault-tolerance mechanisms for long-term network sus-tenance. One of the main challenges in incorporating suchmechanisms in Lo3 is the difficulty of managing the scarceradio bandwidth. Furthermore, we do want to disturb ourprimary applications, i.e., the real-time voice services in anyway while managing our network.

Although there is a bulk of prior work on managing wire-less networks, research papers describing the deploymentand management of a (rural) wireless mesh network (e.g., [7])are few and far between. In this respect, our work makesfollowing contributions.• For ease of network planning and data collection, we

present a systematic and comprehensive design of a manage-ment module over a TDMA based wireless mesh network.We characterize our management module using a Markovchain based analysis and a simulation study, and show thatit operates with minimal bandwidth overhead without af-fecting our primary applications.• We design and implement a simple yet effective fault-

tolerance mechanism for central node failure for our MACprotocol which has a centralized access control. Further,we design several in-network mechanisms to tolerate unpre-dictable node or link failures.• We implemented our management module and fault-

tolerance mechanisms in Lo3 prototype and deployed ournetwork in a village. The villagers used our system for morethan a month and our network served more than hundredvoice calls during this period.

The in-network mechanisms in our management moduleconstantly provided us the desired network feedback whileplanning our network setup. This helped us to carry out aquick and reliable network deployment. Our module success-fully diagnosed several network faults in a live network. Forexample, we could attribute a temporary node instability topoor signal strength and a dip in voice quality to a burst ofpacket losses. The management module also collected de-sired network statistics without affecting the real-time voiceservices. Further, our management module helped us tomonitor the state of the network in the village and collectcall statistics while sitting remotely.

Over a period of one month, our network survived severalinstances of node and link failures, and the villagers used ournetwork to establish several important voice calls. In partic-ular, due to the choice of 802.15.4 technology, our networkreliably operated on AA batteries (totally“off-the-grid”) andavoided network failures due to poor power quality, fre-quently suffered by 802.11 based mesh networks [7]. Webelieve that our design of network management and fault-tolerance mechanisms are applicable to other TDMA-basedwireless mesh networks [15, 4] as well. Further, we hopethat our deployment experiences are useful to researchersand practitioners who wish to deploy and manage wirelessnetworks in future.

The rest of the paper is organized as follows. In nextsection (Sec. 2), we briefly describe prior work to managewireless networks. In Sec. 3, we explain the design of ourmanagement module on a TDMA-based MAC protocol andshow that our module has minimal bandwidth overhead. Insubsequent section (Sec. 4), we describe our fault tolerancemechanisms. In Sec. 5, we then present the usefulness ofnetwork management and fault tolerance mechanisms dur-

ing our deployment. In Sec. 6, we share our deploymentexperiences and finally, in Sec. 7, we conclude this paper.

2. PRIOR WORKWe classify the prior work into three categories: (1) net-

work management in single hop wireless networks, (2) man-agement of wireless mesh networks, and (3) managementand fault-diagnosis in rural mesh networks. In this section,we describe a set of representative examples in above men-tioned categories and in doing so, we point out their limita-tions in meeting the challenges outline in previous section.

Network management in single hop wireless networks: Abulk of literature for single hop WiFi networks [16, 17] han-dles network faults like detection of RF holes, interferencedue to hidden terminals, or rogue access points. However,most of these techniques are either offline or assume highavailability of bandwidth and system resources. Moreover,the fundamental assumption of a single-hop collision domaindoes not hold true for multi-hop wireless networks.

Network management in multi-hop wireless networks: Thereis a bulk of prior work in wireless mesh networks, especiallyin the domain wireless sensor networks [18, 19, 20]. For ex-ample, Sympathy [18] uses a message-flooding approach topool event data and current states or metrics from wirelesssensor nodes. However, it does not ensure that the manage-ment application does not intrude in the working of primaryapplication. Moreover, since there is no coordination fortransporting reports to a central sink node, thus most of thetechniques in prior work are prone to “response implosion”anomaly, where all nodes can attempt to send the reportsat the same time, wasting the network resources.

Management of rural mesh networks: The authors in [7]document the experiences of building a network manage-ment and fault-diagnosis system to manage long distanceWiFi mesh networks deployed in rural areas. To avoid phys-ical visits, the authors employ software and hardware watch-dogs for software and hardware failures and build an inde-pendent management module at each node which tracks andpredicts the health of a node or a link and sends SMS replyover a cellular network. Such information can be used todiagnose several network faults (e.g, antenna misalignment,non 802.11 interference, node in wedged state) and decidewhether a visit is required or not. However, the work in [7]is focused on network-independent control mechanisms forfault-diagnosis in the presence of a cellular back-channel.

In comparison, in our work, we describe a set of mech-anisms for ease of network planning and network manage-ment without the need of any communication back-channel.A large number of villages in developing world do not havecellular coverage; thus, our insights are especially importantfor deploying and managing a rural network without anyback-channel. Our in-node management mechanisms con-siderably simplify deployment efforts and take reactive ac-tions based on changes in network state (e.g., removing lossylinks from routing graph) Further, each node runs a partof network management module which logs essential infor-mation such number of packet losses, time-sync offset, etc.which helps us in network diagnosis and post-failure anal-ysis. Importantly, we analytically verify that our moduleutilizes minimal resources and does not affect the primaryapplication.

3. NETWORK MANAGEMENT MODULE

We now describe the design of network management mod-ule that we have implemented on the top of our MAC pro-tocol, LiT [12]. LiT is a centralized TDMA based MACprotocol, and it divides time in terms of frames comprisingof three types of slots: control slots used for TDMA sched-ule dissemination, contentions slots used for infrequent traf-fic such as node join and voice call requests, and data slotsfor carrying voice traffic. Our network management moduleuses the contention slots of LiT to send management pack-ets towards the root node. The root node then broadcaststhese packets in a dedicated control slot to a special nodewhich interacts with a network administrator and providesmanagement information.

Further, we analytically show that our management mod-ule has minimal overhead on the working of LiT, and itdoes not affect our primary application, i.e., real-time voicestreaming in any way. This is especially important given theresource constrained nature (computation and communica-tion) of 802.15.4 technology.

3.1 Design of management module

Figure 2: Lo3 network management architecture

Components of network management moduleOur module is divided into four components: (1) logger, (2)log-collector, (3) log-listener, and (4) remote updater.

Logger: As shown in Fig. 2, a logger component runs oneach node in the network. The logger components locallylogs several important events using a data structure and aset of function calls. The data structure stores importantevents in the form of logs as event type (e.g., time-sync),corresponding information (e.g., TDMA offset), and eventtime (e.g., time at which the offset was logged). Other suchevents include time consumed to join the network, latencyof establishing a voice call etc. The logger also providesinterfaces to send the logs to log-collector (explained below)when it is queried for log transfer.

Log-collector: Log-collector component runs on a spe-cial node in the network which sits near the root node. Thelog-collector module collects the logs from the network nodesand streams the logs to a program running on a laptop. Us-ing contention slots, the log-collector can issue a query tothe root node to collect logs from a particular node. Thelog-collector program also listens to live log broadcasts fromthe root node and displays the network state accordingly.

Log-listener: We connect log-collector (TelosB node) toa laptop over USB connection to work with a log-listenermodule. The log-listener module is an interactive programimplemented in JAVA which the network administrator uses

to issue queries and seek answers through the log-collector.The log-listener module maintains a history of network runand thus the network-administrator can also issue queries toanalyze the collected data to get answers locally.

Remote updater: We use the back-haul connectivity ofour gateway module and implement an SMS remote updater.Since the TDMA schedule is broadcasted to all networknodes, the gateway node too receives the schedule whichconsists of (1) number of nodes in the network in termsof control schedule and (2) information of ongoing calls interms of data schedule. Remote updater extracts the neces-sary information from the schedule and sends periodic up-dates as SMSs to a phone in our lab. This way, we canremotely monitor the activities in our network.

Working of network management module

Figure 3: Integration of management module in LiTLive broadcasts by the root node: Since we have

central control at the root node, the state at the root nodeconveys the entire network state. Having such a centralcontrol is an important factor for network management sincemost of the time, it suffices to fetch the network state fromthe central node. Further, to provide live updates in thenetwork, the root node grants itself an extra control slotin control schedule, and explicitly to broadcasts the changein its state (see Fig. 3). Note that, since control slots arecollision-free, root has exclusive control over such broadcast.

Such live broadcasts are received by the log-collector (whichsits in the range of the root node), the collector processesand displays the log updates using log-listener component.The root node logs important events such as reception ofjoin request, reception of flow request, reception of topologyupdates, change in the network connectivity graph, changein the status of the network links etc.

Log collection from non-root nodes: When a networkadministrator interacts with the log-listener program, thelog-collector issues a query to a specific node through theroot node. Listening to the control packets from the rootnode, the log-collector synchronizes with the network, anduses the contention slots in the TDMA frame to send a queryto the root node with the identity of the node to query.When the root node receives the ping, it enables a specialflag in its control schedule which signifies the identity of thenode to query. Now, during control schedule dissemination,when the target node receives a control schedule, it preparesitself to send response to the query.

To send the logs stored by the logger to the log-collector,the nodes use the contention slots to first send the log pack-ets to the root node. The contention slots use the parent-child relation defined by the control schedule to send thepackets towards the root node. The root node then for-wards the logs to the log-collector. Since per-hop acknowl-

edgements are enabled in contention slots and since we donot observe any queue overflows, the log-collector reliablyreceives all the logs from the logger module.

Displaying logs: The log-listener program running onthe laptop listens to the data sent by the log-collector throughthe USB port. When the log-collector module receives alllogs from a node or receives broadcasts from the root node,it streams the logs over a USB connection. The log-listenerprogram then displays the logs on the terminal. The listenerprogram also creates a HTML report of the logs and createshistory of the status of the network over a time period.

3.2 Low overhead of management moduleWe now show that the working of our management mod-

ule has minimal overhead on the working of LiT and that itdoes not affect the primary real-time application while op-erating simultaneously. As we have noted earlier, our man-agement module uses the contentions slots in the TDMAframes. Thus, a natural question here is do the managementpackets overload the contentions slots? To enable a highutilization of contention slots, we have carefully designed amulti-hop contention access mechanism for channel access,and it is our choice of an efficient channel access mechanismthat keeps the overhead of management packets to minimal.An efficient contention slot mechanism not only results inlesser collisions in contention slots, but also leads to fewernumber of transmissions in delivering a packet from all nodesto the root node.

For Lo3, the channel access mechanism in contention slotsis inherently probabilistic 1. We evaluated two different algo-rithms for contention resolution and efficient channel access:A1 with p = 1

mnand A2 with p = (h−i)∑

i i, where p is access

probability, n number of collisions, mn maximum number ofnodes in a collision domain, i level of node, h total levels. Inour centralized system, these parameters can be calculatedand dynamically conveyed by the root to each node of thenetwork. For the ease of transmitting packets in contentionslot towards the root node, we arrange the network in atree topology, and the level of a node is its hop count fromthe root node. However, note that the transmission of datapackets happens over the graph formed by the underlyingnetwork links.

It is well-known that in a single-hop collision domain, ifeach node transmits with probability p = 1

mn, the total num-

ber of slots used to send at least one packet from each nodeis minimized. However, intuitively speaking, given the cen-tralized nature of LiT, the nodes closer to the root shouldreceive more transmission opportunities. With this observa-tion, we design algorithm A2, where a node at level i trans-

mits with probability p = (h−i)∑i i

. Suppose a tree has h = 4

levels, with levels numbered from 0 to the root node, 1 forthe first level, increasing by 1 until level h − 1 = 3 for theleaves, in Breadth First Search (BFS) order. Then, the level1 nodes access the contention slot with probability 3

6, level

2 nodes access the contention slot with probability 26, and

level 3 nodes with probability 16. We note that the choice of

A2 is far from ideal, but we believe this exercise still valu-able since our goal is to make a case for a contention accessmechanism for a tree topology which is the norm in most ofthe deployed networks. To this end, we ask, how does A2

1For the inapplicability of carrier sense and back-off algo-rithms, refer to our work prior work [12]

perform in comparison to A1?

Modelling contention slot access algorithmsTo select the most efficient mechanisms among A1 and A2,(the one which uses least number of contention slots) we ab-stract an access mechanism as a Markov model. A Markovmodel comprises of a set of states and a set of transitionsamong the states. Each state in our Markov model repre-sents the number of impending packets for each non-rootnode of the network and the number of packets received bythe root node.

To visualize a Markov model, consider a simple topologyshown in Fig. 4(a) with A as B’s parent, and R as A’s parent.Fig. 4(b) shows absorbing Markov model for this topologywhere each state comprises of three values: number of im-pending packets for B, number of impending packets for A,and the number of packets received by the root. The figurealso shows the transitions between the states. For instance,the system moves from the state (1, 1, 0) to the state (0, 2, 0)when B successfully transmits a packet to A, with probabil-ity Pb = P1(1−P2) where P1 is the transmission probabilityof node B, and P2 is the transmission probability of nodeA. The probability of arriving in a state depends only onthe previous state, thus the model holds the Markov prop-erty. The state (0, 0, 2) is the absorbing state since it is notpossible to leave (no outgoing transition) this state. Sincethe Markov chain has at least one absorbing state, and sincefrom every state it is possible to go to an absorbing state,the Markov chain shown in the Fig. 4(b) is a valid absorb-ing Markov chain. The corresponding transition probabilitymatrix P is shown in Fig. 5(a).

To solve the Markov chain, and find the number of stepsrequired to reach absorbing state from initial state (whichcorresponds to the number of contention slots required), weapply a solution technique in [21]. As part of this technique,the transition probability matrix P is arranged in canonicalform where all absorbing states appear after transient statesin rows and columns of the matrix. Note that, an absorbingMarkov model may have multiple absorbing states. How-ever, for our model, we have only one absorbing state whichrepresents packets received by the root node. Let Q repre-sent a matrix formed only by the transient states. Similarly,let R represent matrix formed only by the absorbing states.The resulting matrix P in terms of Q and R is shown in ofFig. 5(b) where I is the identity matrix.

Now, let Pn be the transition matrix after n steps shownFig. 5(b). In Pn, as n → ∞, Qn → 0, and (I + Q + Q2 +. . .) → N , where N is called as the fundamental matrixof the absorbing Markov model. An entry nij of the fun-damental matrix N signifies the expected number of timesthat the process is in the transient state Tj if it is startedin the transient state Ti. For instance n12 represents theexpected number of times (or steps) the network is in thetransient state T2 if it is started in transient state T1. Ourgoal is to find out the expected number of steps before thechain is absorbed in state A0, given that the chain starts instate T1. This is given by m1 =

∑j n1j in timing matrix M

whose entries are of the form mi =∑

j nij . Now, as n→∞,

N = (I + Q + Q2 + . . .)→ (1−Q)−1. Thus, if we calculateN = (1−Q)−1, we can calculate M which gives us m1. Weuse Scilab to calculate (1−Q)−1, and hence m1.

Using our Markov model, we now analyze Algorithms A1

(p = 1mn

) and A2 (p = i∑h−1j=1 j

). For analysis, we consid-

ered three topologies as shown in Fig. 6. We calculated thetransition matrix manually, and used Scilab [22] to computethe timing matrix. This computation gives us the total ex-pected number of contention slots required for each node todeliver a packet successfully to the root node. We tabulatethese results in Tab. 1. From the table, it can be seen thatour choice design choice of algorithm A2 performs exceed-ingly well than algorithm A1, thus managing the contentionslots efficiently. We note that, our model can be used toanalyze any channel access algorithm which assigns a fixedprobability to network nodes.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

BA

R

C D E

F G HRABC

P1P2

B A R

(c)(b)

(a)

Figure 6: Sample topologies to analytically evaluatecontention slot access algorithms

Algorithm Topo (a) Topo (b) Topo (c)

A2 5 13 25A1 9 21 39

Table 1: Analytical comparisonAlthough above Markov analysis is useful to theoretically

verify the efficiency of a probability based access mechanism,it becomes cumbersome in practice to analyze large topolo-gies (since the size of transition matrix is quadratic in termsof number of nodes). Thus, to test the scalability of accessmechanisms, we experiment with larger topologies and com-pare the two algorithms empirically. We set up a networkof 25 infrastructure and 100 handset nodes with 2% link er-ror rate in our custom-built simulator [12]. Every handsetnode in our simulator generates voice calls of a mean callduration of 2 minutes and a mean inter-call duration of 1hour, both generated by exponential distribution. Since weare limited by memory on our TelosB platform, in our simu-lations, we set the contention slot queue size as 8. We thenrecord the total number of packets sent or forwarded (includ-ing retransmissions, if any), and the number of queue drops.We also record the of slots used for probabilistic-back-off byeach node.

We run the simulations multiple times while randomlychoosing the node locations, thus occasionally forming ir-regular topologies, and count the number of contention slotsused by each algorithm. Our simulation shows that (1) inno case, A2 requires more number of contention slots thanA1, and (2) on an average, A2 requires only 2

3fraction of

contention slots in comparison to A1. In terms of effect ontree stability, we did not observe any queue drops or spuri-ous failures for A2. Importantly, A2 does not affect the flowrequests (to establish the call), periodic bandwidth requests(to maintain the call), and tear down requests (to terminalthe call) that traverse in contention slots. That is, we donot observe any drop of packets belonging to primary ap-plication. Thus, A2 is indeed effective for use in contention

��

��

��

��

B A R

P1P2

(a)

Pa1−( +Pb )

Pa

Pb

P2

P1P2

1−P2

1−P2

1, 1,1,

1,0, 0,

0,

0,

2,

0,

0

0

1

1

2

1−P1

(b)

Figure 4: Example of absorb-ing Markov chain, Pb = P1(1 −P2), Pa = P2(1− P1)

TRS1 TRS2 TRS3 TRS4 ABS0

TRS1 (1− (p1 + p2 − 2p1p2) p2 − p1p2 p1 − p1p2 0 0TRS2 0 1− p1 0 p1 0TRS3 0 0 1− p2 p2 0TRS4 0 0 0 1− p2 p2ABS0 0 0 0 0 1

(a)

TRS ABS( ) ( )P = TRS Q R → Pn = Qn (I + Q + . . . + Qn−1)R

ABS 0 I 0 I

(b)N = (I + Q + Q2 + . . .) = (I −Q)−1, M = Nc where c = [1 1 . . . 1]T

Figure 5: TRS:Transient state, ABS:Absorbing state, P:Transition Matrix

slots in multi-hop networks.To summarize, our management module has four impor-

tant properties. (1) With the use of efficient channel accessmechanism, our module is non-intrusive and it does not af-fect our primary application. (2) It has minimal overhead.(3) Our simulation study confirms that it is scalable. (4)It ensures reliability in delivery since we enable per-hop ac-knowledgements in contention slots and there are no queuedrops at any hop. We now move on to some of the faulttolerance mechanisms that we have implemented in our pro-totype to provide robustness against unpredictable failuresin the network.

4. FAULT TOLERANCEDuring our in-lab and in-campus test-deployments, we ob-

served a few instances of node or link failures after a run ofseveral hours. For instance, the timer component in our soft-ware failed or the radio hardware on TelosB stopped work-ing. Thus, to recover the network from such failures, wehave built a set of in-network fault tolerance mechanismsin our prototype. We categorize these mechanisms in twotypes: (1) root node failure, and (2) unpredictable softwareor hardware failures at any node. Such mechanisms are es-pecially important to us since the cost of debugging andrecovery through a physical visit is quite costly; each visitto Ahupe village costs us USD150.

4.1 Tolerance to central node failureLo3 network employs a centrally controlled MAC proto-

col. Such centralized networks have advantages in terms ofease of network administration and maintenance, but theyare generally frowned upon for the lack of fault tolerance. Toaddress this concern, we have implemented a fault-tolerancemechanism in our prototype. In our mechanism, the pri-mary root node chooses one of its immediate infrastructurenodes a-priori as the backup. This backup node listens toperiodic broadcast packets sent by the primary root node inthe control slots. If it does not receive such packets for cer-tain period, it concludes that the primary root node is down,and declares itself as the new root in the network. When theprimary root node is repaired, the backup node relinquishesthe central control back to the primary root node.

A common miscreant to ensure reliability and consistencyin such recovery mechanisms is wireless packet losses. Forexample, due to the loss of broadcast packets from the root

R

R’

T’

T

C

Figure 7: Two trees with a common node

node, it is possible that the backup node falsely concludesthat the root node is down. How should we tackle suchinconsistent state? In this section, we describe our fault-tolerance mechanism which ensures that the network con-verges to a consistent if such inconsistency occurs in thenetwork. We have implemented this fault-tolerance mech-anism in our prototype which achieves a fail-over latencyof just 25 seconds. Our mechanism is simple yet effective,in contrast to complex distributed root election algorithmswhich may take longer time to converge, especially in thepresence of wireless packet losses.

Challenges in detecting root failure. It is quite chal-lenging to detect that the failure of root node. Firstly, a se-ries of wireless packet losses due to interference or externaleffects can trigger a false detection of root failure. Secondly,in comparison to one-hop centralized architectures, in multi-hop centralized networks, a central node failure can lead toa cascading effect where all nodes, including the nodes morethan one-hop away from the root may start the recoveryprocess. This can decrease the overall availability in the net-work. Thirdly, due to false detection of failure, more thanone root node may be active in the network. This can leadto an inconsistency in the network state with two differentTDMA schedules. The authors in [23] abstract these chal-lenges as (1) partition tolerance (due to the wireless packetlosses), (2) availability (due to the downtime for the recov-ery), and (3) consistency (due to the activities by more thanone root node). Further, this work points out that it is im-possible to meet all the three challenges at the same time.

However, as mentioned in [23], it is possible to trade oneproperty while ensuring other two properties. For our net-work, we wish to maximize network availability since webelieve that “no network” situation is worse than “some net-

work”. Thus in our mechanism, we trade consistency whileguaranteeing availability and partition tolerance.

Fault tolerance algorithm for central node failure.If R′ falsely detects that R is down, the presence of tworoot nodes can lead to inconsistency in schedule since R andR′ may allocate the same channel resource to two differentnodes or flows. Thus, it is necessary to detect the presenceof two root nodes in such cases. We claim that, if there isno bridge between R and R′, then it is possible to detect thepresence of more than one root node. An edge in a graph iscalled as a bridge, if its removal disconnects the graph. Inthis respect, we now give a method to detect the presenceof two root nodes

Let us represent the tree formed by R as T , and the treeformed by R′ as T ′ as shown in Fig. 7. If there is no bridge inthe network, then there exists a node in the network whichhas a link to at least one node in tree T ′ and at least onenode in tree T respectively. Thus this node, say node C, isable to listen to the broadcast packets, disseminated as thepart control schedule, originated at two different root nodes.

Each of the control packets has a field which signifies thehardware level identifier of the root node. Thus, C can dif-ferentiate between the control packets received on two dif-ferent trees (see Fig. 7). Then, C can detect the presence ofR and R′, and inform R′ that R is still up. For this purpose,we use special NOTIFY ROOT packets in contention slotswhich notify R′ of the presence of R. Once R′ receives thisinformation, it terminates its status as the root node andrejoins the network as an infrastructure node. Now, if R′

joins the network at a level which is at least two hops awayfrom R, R can select a new node R′′ as the root node. Also,if R goes down while R′ is joining the network, R′ will detectR failure, and it will (again) start acting as the root.

What happens if the link between R and R′ is thebridge? In this case, if R goes down, and if R′ is selectedas the next root, the network will be divided into two parts(P1 and P2) as shown in the figure. If R is up while R′

becomes root, P1 will be operated by R, while P2 will beoperated by R′. In this case, as long as R and R′ maintainthe same schedule state, i.e., the same allocation of controland data schedules, the two partitions will continue to oper-ate independently. Otherwise, two different schedule statesmay result in inconsistency in both the components of thenetwork. For such a level of inconsistency, we rely on thelog-collector in our network management module which sitsnear R. For instance, if the number of deployed nodes donot match the number of active nodes, the log-collector nodecan raise an alarm to network administration. In worst case,if the logger node too is cut-off from R, then the network ad-ministrator relies on external events, e.g., complaints fromthe users, and take the required set of actions, e.g., rebootingR and then R′.

Overall, our mechanism trades consistency and managesthe failure of the central root node while ensuring that atleast one root is available all the time and tolerating thewireless packet losses. It also ensures that, in case of in-consistency, the network converges to a consistent state in ashort period of time. We now briefly explain what happenswhen a node or a link failure occurs.

4.2 Tolerance to unexpected software & hard-ware failures

As we noted earlier, while testing our prototype, due to

unexpected behavior of software and hardware components,we observed a few instance of infrastructure and handsetnode failures. For instance, at a few occasions, the SpeexCodec running on C5515 platform went into a bad state orthe timer component of the TinyOS software failed to trig-ger, or the radio component of TelosB went into an inde-terminate state. To provide resilience to such node failures,we have enabled watchdog timers available with MSP430MCU on our infrastructure node and with C5515 DSP onour handset and gateway nodes. In our logger module, wekeep on clearing the counter of our watch-dog timer so longas all the intended components are working; otherwise itcounts down and resets the processor when it reaches itslimit. A processor reset in our platforms results in total re-boot of the node which recovers the node from any softwareor hardware failures. After reboot, the node executes thenode join procedure which involves discovering the neigh-borhood by listening to the control schedule (or beacons)in control slot, getting in sync with the root node, placinga join request in contention slot and discovering the parentnotified by the subsequent control schedule.

Tolerance to link failure: Our management modulealso provides tolerance to link failures in the network. Forinstance, if a link in the network graph suffers from poorsignal strength or heavy packet losses, the root node analy-ses the periodic logger updates and removes such links fromthe network graph. Further, the root can take appropri-ate action, such as, assigning a new parent to a node if thenode faces unpredictable variations on the link to its currentparent. Overall, our management module has four impor-tant functions: (1) proactive mechanisms for ease of networkplanning, (2) interfaces to store and display information ofapplication performance, (3) in-node mechanisms for soft-ware and hardware failures, and (4) reactive mechanisms todiagnose and recover from the losses due to link or nodefailures. We now document the usefulness of these functionswhile managing deployment of Lo3 on the field.

5. USE OF NETWORK MANAGEMENT &FAULT-TOLERANCE MECHANISMS

Figure 8: Deployment of Lo3 at AhupeWe test deployed a full-fledged Lo3 network in Ahupe vil-

lage, near Mumbai, India for 18 hours over 3 days. We setup a 10 node network in the village as depicted in Fig. 8.We attached 8dBi external antenna to our nodes with RF

Queries during or after deployment Answers or hints by management module

Is the location and height of antenna Instant statistics of strength and stabilityappropriate to fix a node? of corresponding network links

What has been the network performance: Periodic collection of logs from network nodes,join latency, call-grant latency, node-stability, real-time analysis of network logs, updatingcall routing path, call error rate, node failures and adjusting network state accordingly

How many calls have been made so far? Instant SMSs for remote monitoring

Table 2: Usefulness of network management module

cables, and tied the antennas at a height of about 3 metersto a plastic pipe, using clamps (see Fig. 9). We placed thegateway node at a point where we get cellular connectivityand using two 802.15.4 links; we could extend the connec-tivity inside the village about 0.6km away from the place ofcellular coverage. We also placed several nodes within thevillage to enable local voice telephony (see Fig. 10). Whileplanning our network setup, to ensure reliable deployment,we were interested in knowing answers to several questionsregarding network performance. We now present those setof questions and response of our management module to an-swer such questions. See Tab. 2 for summary.

Question 1: Is the newly joined node stable? Whatis the effect of the variations in antenna height? Weemployed an approach of incremental deployment, thus, af-ter every node addition, we wanted to ensure that all thefunctions of our network are working as expected. One ofthe important factors in our network is the stability of net-work links. For outdoor links, it is well-known that a link canexperience dip in signal strength due to destructive interfer-ence of direct and reflected rays. We frequently encounteredsuch dips in signal strength while fixing our node locationsand antenna heights. In fact, we observed up to 20dBm ofdifference in signal strength when we varied antenna heightsby just 25cm or node locations by just 10m [14].

The in-network mechanisms in our management modulesignificantly simplified the planning of network links. Thelogger component of our management module sent instantupdates regarding the state of neighborhood of a newly addednode. This information was sent to the root node and thenbroadcasted to log-collector. The log-collector streamed thisinformation to the log-listener program. The log-listenerprogram then displayed a two-dimensional table showing thestrength and error rate of each link in the network. Throughsuch series of updates, we could observe the signal strengthand error rate of network links, and thus, fix appropriatenode location and antenna height combinations to enablehigh quality network links.

We calculate link error rate as follows. For a given num-ber of infrastructure nodes, the period of broadcasting byeach infrastructure node is fixed. Every network node cancalculate such period by looking at schedule and expects re-ception of a packet from its neighboring infrastructure nodein this period. Thus, for a pre-decided interval of 10 secondsor 20 seconds, a node logs the number of expected and ac-tual packet receptions, the difference of which gives us thelink error rate. Thus, using such reports regarding link sig-nal strength and link error rates, we planned our networklinks carefully. We then observed that 802.15.4 links remainstable in outdoor environment. with a standard deviationof about 2-3dBm and packet loss of about 1-2%.

Question 2: What is the node-join latency of thenewly added node? How many nodes have joined so

far? What are the join statistics? To answer these ques-tions while on the field, we interacted with our log-listenerprogram and fired queries to the corresponding nodes tofetch the network logs. The logger component running oneach node responded to this query and sent the logs to thelog-listener program. Our log-listener program stored andconstantly updated the network state locally. We could thusinteract with our log-listener program to get answers to thesequestions. The statistics of nodes joined in the network con-sisted of (1) the time of the join for each node, (2) the parentof a network node, (3) the time at which a node rejoined,and (4) the signal strength and error rate of the link to itsparent. Such information especially helped us to ensure thatour incremental deployment is progressing well. During ourtest deployment in the village, over the period of 3 days, weobserved only 4 instances of rejoins. Furthermore, we couldattribute the cause of rejoins to bad link qualities.

Question 3: What is the routing path of the newlyaccepted voice call? What is the packet loss per-centage experienced by a recently terminated call?Whenever a voice call is accepted, the schedule state at theroot node changes and we receive such state changes at log-collector through live broadcasts by the root node. Thus,during our deployment, our log-collector constantly receivedsuch updates and displayed the change in schedule state withthe help of log-listener program. This change in schedulestate contained the {slot, channel} allocation to each linkon the routing path.

Further, the call logs received from the source and desti-nation nodes of a call contained the number of expected andactual packet reception. In our implementation, a schedul-ing period is same as the period of the TDMA frame, thus,we can calculate the number of expected packet receptionsas the number of frames elapsed between the start of thecall and end of the call. By using the expected and actualreceptions, the log listener program calculated and displayedthe packet loss percentage. By using such statistics, for afew voice calls, we could also attribute a temporary dip invoice quality to a burst of packet losses.

Thus, during our deployment, we enabled several localvoice calls consisting of one to three hops in the village. Wealso enabled several remote voice calls consisting of two hopsthrough our gateway node. The voice calls in our networkexperienced mean packet loss of 1.4% with a standard devi-ation of 0.8%.

Question 4: How many calls have been made sofar? What are the call statistics? As we have mentionedearlier, the log-listener program maintains a local copy of thenetwork state and builds a history of network events throughour network management module. Thus, during our deploy-ment, we could observe the number of calls made by thevillagers over a day along with call statistics which includes(1) source and destination nodes of a call, (2) correspond-

Figure 9: Lo3 infrastructure node Figure 10: Lo3 in use

0

1

2

3

4

5

6

April-12

April-13

April-14

April-16

April-17

April-18

April-20

April-22

May-6

May-9

May-10

May-14

May-26

May-28

May-29

June-2

June-4

June-6

June-10

June-31

July-1

July-2

July-3

July-7

July-8

July-9

July-10

July-11

July-12

July-13

July-21

July-22

July-24

Call statistics during deployment at Ahupe village

Dialed-calls Received-calls

Figure 11: Call statistics of a Lo3 user through the gateway node, dialed calls pertain to the calls originatedfrom the village while received calls correspond to the calls originated outside the village

ing routing path, (3) duration of the calls, and (4) packetloss statistics. Over the period of 3 days during our test de-ployment, the villagers established more than 8 local voicecalls, and more than 6 remote voice calls. We calculated theMean Opinion Score (MOS) using the codec rate, the delayand the packet loss rate of the call; almost all calls reportedMOS greater than 3.6, which is considered as good in prac-tice. Using our remote-updater module, we have extractedthe number of calls made by the villagers over 30 days. Theplot in Fig. 11 shows the number of dialed and received callsby a Lo3 user.

Question 5: How many nodes or links failed sofar? How many of them recovered? Our log-listeneranalyzed the logs by calculating the difference between dif-ferent network states. For example, given an update in net-work state, the log-listener checked the local network statebefore incorporating the update. Thus, through periodic re-ception and analysis of network logs, we could detect thefailure of a link or failure of a node. During our test deploy-ment, we observed only one instance of radio failure, and ourwatchdog-timer based fault-tolerance mechanism kicked ininstantaneously to recover the node within about 2 minutes.We did not observe root node failure. The villagers usedour network for more than a month for real-time voice calls,

and our fault-tolerance mechanisms turned out to be quiteuseful to sustain the network deployment over this period.

However, as one can observe in Fig. 11, our system wentdown several times. During April 22 to May 6, our systemwas down due to the failure of GSM modem and we had toreplace the modem through a physical visit. During May14 to May 26, the battery of our gateway node failed andneeded replacement. The villagers packed-off gateway insideto protect it from heavy rains during June 10 to June 31.We then prepared and installed a small shed for our gatewayto restart its operation. We also updated power kit of ourgateway with a low cost (USD6) solar panel (3W) auto-recharging the gateway battery, especially during monsoonseason. After operating wonderfully for 3 weeks, on July24, the power kit of our gateway node was stolen. At thispoint of writing, we are planning a physical visit to repairthe gateway and increase scale of our deployment in Ahupevillage with a set of enhancements in Lo3 handset.

6. DEPLOYMENT EXPERIENCESPower requirement. Over the period of 2 days of net-

work planning, Ahupe village received electricity only for aperiod of 2-3 hours, and our laptop batteries (we use lap-

tops for programming Lo3 nodes and analysis of logs) soondrained out at the end of 2nd day. With this experience,we think that managing the power source in a rural deploy-ment is a crucial factor given the abysmal electricity supplyin rural areas. In fact, the authors in [7] point out that thereal cost of network is not only the grid power but also thecost required to overcome power quality problems. In thiscontext, Lo3 operated remarkably well, Lo3 infrastructurenodes worked on AA batteries (completely “off-the-grid”)during this period and continued to do so for a period ofone month. In fact, during our deployment, Ahupe villagewitnessed incessant rains for several days, and our networkoperated quite well without any trouble of battery replace-ment or recharge. This confirmed our premise that 802.15.4is indeed a better choice in comparison to 802.11 in termsof sustaining a network deployment in hostile environmentswith poor electricity supply.

Security of network nodes. After a successful run fora week, our system became a hit among the nearby villagersdue to the word-of-mouth publicity. As a consequence, wereceived a notification from department of forest, Maharash-tra, India to apply for a permit if we have to operate a cellu-lar base station. We convinced the officials that our systemdoes not include any base station components. Since severalcomponents in our gateway node such as rechargeable bat-tery or GSM modem, have resell value, we were concernedabout the safety of our gateway node. The gateway crashincident mentioned in previous section highlighted the im-portance of security measures. For our future deployments,we are planning to lock our gateway node in a small com-partment to protect it from such damages.

7. CONCLUSIONSIn this work, we presented the design and use of various

network management and fault-tolerance mechanisms forcentrally controlled wireless mesh network. The in-networkmechanisms of our system reported live network updatesduring deployment which eased out our deployment activ-ities. We also implemented a set of low overhead networkmanagement mechanisms which helped us collect statisticsregarding network performance without affecting the pri-mary applications. Further, the fault-tolerance mechanismsprovided necessary resilience to our system against the net-work node and link failures. With these set of mechanisms,we deployed our mesh network in a village, in India, andour network enabled more than hundred voice calls over aperiod of one month. The villagers gave a very positiveresponse to our system. We hope that our experiences ofnetwork planning, in-network data collection and networksustenance would be useful for future systems targeted forrural deployment.

8. ACKNOWLEDGMENTSWe would like to thank to Mr. Anand Kapoor and Mr.

Sanjay Asavale at Ahupe for their help during our deploy-ment and CouthIT, Hyderabad, India for providing us eval-uation copy of Speex on C5515 platform.

9. REFERENCES[1] L. Subramanian, S. Surana, R. Patra, S. Nedevschi, M. Ho,

E. Brewer, and A. Sheth, “Rethinking Wireless in theDeveloping World,” in Hot Topics in Networks(HotNets-V), November, 2006.

[2] “IEEE 802.11, The Working Group for Wireless LANs,”http://grouper.ieee.org/groups/802/11/.

[3] K. Chebrolu and B. Raman, “FRACTEL: A FreshPerspective on (Rural) Mesh Networks,” in NSDR, 2007.

[4] R. Patra and S. Nedevschi and S. Surana and A. Sheth andL. Subramanian and E. Brewer, “WiLDNet: Design andImplementation of High Performance WiFi Based LongDistance Networks,” in NSDI, 2007.

[5] “Project Ashwini,”http://www.nisg.org/docs/project ashwini.pdf.

[6] “Village Telco,” http://www.villagetelco.org/.

[7] S. Surana, R. Patra, S. Nedevschi, M. Ramos,L. Subramanian, Y. Ben-David, and E. Brewer, “BeyondPilots: Keeping Rural Wireless Networks Alive,” in NSDI,2008.

[8] B. Raman and K. Chebrolu, “Lo3: Low-cost, Low-power,Local Voice and Messaging for Developing Regions,” in 3rdACM Workshop on Networked Systems for DevelopingRegions (NSDR), 2009.

[9] “Ieee 802.15 wpan task group 4 (tg4),”http://www.ieee802.org/15/pub/TG4.html.

[10] V. Gabale, B. Raman, K. Chebrolu, and P. Kulkarni,“LokVaani: Demonstrating Interactive Voice in Lo3,” inDemo Paper In 16th ACM Conference on Special InterestGroup On Communications and Computer Networks(SIGCOMM), New Delhi, India, Aug 2010.

[11] V. Gabale, R. Gopalakrishnan, and B. Raman, “The PilotDeployment of A Low Cost, Low Power Gateway ToExtend Cellular Coverage In Developing Regions,” in 5thACM Workshop on Networked Systems for DevelopingRegions (NSDR), Washington D.C., U.S.A., June, 2011.

[12] V. Gabale, B. Raman, K. Chebrolu, and P. Kulkarni, “LiTMAC: Addressing The Challenges of Effective VoiceCommunication in a Low Cost, Low Power Wireless MeshNetwork,” in 1st ACM Symposium on Computing forDevelopment (DEV), London, U.K., Dec 2010.

[13] “TelosB Mote Platform,”http://www.willow.co.uk/html/telosb mote platform.html.

[14] V. Gabale and J. Patani and R. Mehta and R.Kalyanaraman and B. Raman, “Building A Low Cost LowPower Wireless Network To Enable Voice CommunicationIn Developing Regions,” in IITB/CSE/2012/September/48,TR-CSE-2012-48 (submitted to MC2R).

[15] A. Dekhne, N. Uchat, and B. Raman, “Implementation andEvaluation of a TDMA MAC for WiFi-based Rural MeshNetworks,” in NSDR, 2010.

[16] Y. Cheng, J. Bellardo, P. Benko, A. C. Snoeren, G. M.Voelker, and S. Savage, “Jigsaw: Solving the Puzzle ofEnterprise 802.11 Analysis,” in SIGCOMM, Sep 2006.

[17] A. Sheth and C. Doerr and D. Grun wald and R. Han andD. Sicker, “Mojo :a Distributed Physical Layer Anomaly

Detection System for 802.11WLANS,” in MOBISYSaAZ06.[18] N. Ramanathan and K. Chang and R. Kapur and L. Girod

and E. Kohler and D. Estrin, “Sympathy for the sensornetwork debugger,” in SenSys, 2005.

[19] W. L. Lee, A. Datta, and R. Cardell-Oliver, “WinMS:Wireless Sensor Network-Management System, AnAdaptive Policy-Based Management for Wireless SensorNetworks,” 2006.

[20] F. Mauro, G., and S. Carla, “Decentralized Fault Diagnosisfor Sensor Networks,” in CASE’09.

[21] J. G. Kemeny and J. L. Snell, and G L Thompson, “Ch 5.8,Absorbing Markov Chains,” Introduction to FiniteMathematics, Prentice-Hall, pp. 282–291, 1966.

[22] “Scilab,” http://www.scilab.in/.

[23] E. Brewer, “Towards Robust Distributed Systems,” inPrinciples of Distributed Computing, ser. PODC-2000,2000.

http://grouper.ieee.org/groups/802/11/

http://www.nisg.org/docs/project_ashwini.pdf

http://www.villagetelco.org/

http://www.ieee802.org/15/pub/TG4.html

http://www.willow.co.uk/html/telosb_mote_platform.html

http://www.scilab.in/

deployments made easy: essentials of managing a (rural ...dev3.acmdev.org/papers/dev-final20.pdf ·...

Documents