time synchronization for multi-modal target tracking in heterogeneous sensor networks

Time Synchronization for Multi-Modal TargetTracking in Heterogeneous Sensor Networks

Isaac Amundson, Manish Kushwaha, Branislav Kusy, Peter Volgyesi,Gyula Simon†, Xenofon Koutsoukos, Akos Ledeczi

Institute for Software Integrated Systems (ISIS)Department of Electrical Engineering and Computer Science

Vanderbilt UniversityNashville, TN 37235, USA

Email: [email protected]†Department of Computer Science

University of PannoniaVeszprem, Hungary

Abstract— Heterogeneous sensor networks consisting ofresource-constrained nodes as well as resource-intensive nodesequipped with high-bandwidth sensors offer significant advan-tages for developing large sensor networks for a diverse set ofapplications. Target tracking can benefit from such heteroge-neous networks that support the use of sensors with differentmodalities. Such applications require tight time synchronizationacross the heterogeneous sensor network in order to improveboth the estimation and real-time performance. In this paper wepresent a methodology for time synchronization in heterogeneoussensor networks. The synchronization methodology has beenimplemented as a network service and tested on an experimentaltestbed demonstrating an accuracy in the order of microsec-onds over a multi-hop network. In addition, we use the timesynchronization method in a multi-modal tracking applicationfor performing accurate sensor fusion of audio and video datacollected from heterogeneous sensor nodes and we show that ourmethod improves tracking performance.

I. INTRODUCTION

Wireless sensor networks (WSNs) consist of large numbersof cooperating sensor nodes and have already demonstratedtheir utility in a wide range of applications including envi-ronmental monitoring, transportation, and health care. Thesetypes of applications collect sample data by monitoring theenvironment. In order to make sense of the collected samples,nodes usually send the data through the network to a sensor-fusion node where it can be analyzed. We call this processreactive data fusion. One important aspect of reactive datafusion is the need for a common notion of time amongparticipating nodes.

In this paper, we consider multi-modal tracking as a re-active data fusion process where several nodes cooperatein estimating the position of a target. Target tracking usingmultiple sensor modalities is considered to be more robust,more accurate, more compact, and yields more informationthan using a single sensor modality [1]. A typical multi-modaltracking system consists of heterogeneous sensor nodes withdifferent modalities, such as audio and video, sensing theirlocal environment, and communicating with each other and

the sensor fusion node. The local measurements are combinedat the sensor fusion node for estimating the position andother characteristics of the target. A target tracking applicationrequires tight synchronization across the heterogeneous sensornetwork in order to improve both the estimation and real-timeperformance.

Heterogeneous sensor networks (HSNs) consist of resource-constrained nodes as well as resource-intensive nodesequipped with high-bandwidth sensors such as cameras andare a promising direction for developing large sensor networksfor a diverse set of applications [2], [3], [4]. Accuratelysynchronizing the clocks of all sensor nodes within a het-erogeneous network is not a trivial task. Existing WSN timesynchronization protocols (e.g. [5], [6], [7], [8], [9]) performwell on the devices for which they were designed. However,these have not been applied to networks of heterogeneousdevices.

Our work focuses on achieving microsecond-accuracy syn-chronization in HSNs. We consider a multi-hop networkconsisting of Berkeley motes and Linux PCs, a dominant con-figuration for reactive data fusion applications in HSNs. Motenetworks are capable of monitoring environmental phenom-ena with small energy consumption that allows for extendedlifetime. PCs can support higher-bandwidth sensors, such ascameras, and can run intensive processing algorithms on thecollected data before routing it to the sensor-fusion node.Time synchronization in both mote and PC networks has beenstudied independently, however, a sub-millisecond softwaresynchronization method combining these two networks has notyet been developed to the best of our knowledge.

The contribution of the paper is twofold. First, we developeda technique for synchronization between a mote and a PCimplemented completely in software. The technique enablesthe integration of existing protocols for synchronization ofmote and PC networks. We have implemented a time syn-chronization service for reactive data fusion applications basedon this methodology and our experimental results show that

we can achieve synchronization accuracy on the order oftens of microseconds. Second, we use the time synchroniza-tion method in a multi-modal tracking application. The timesynchronization service allows us to perform accurate sensorfusion of audio and video data collected from heterogeneoussensor nodes. Our experimental results show that using an het-erogeneous sensor network can improve tracking performance.

Time synchronization in sensor networks has been studiedextensively in the literature and several protocols have beenproposed. Reference Broadcast Synchronization (RBS) [6] is aprotocol for synchronizing a set of receivers to the arrival of areference beacon. Timestamp Synchronization (TSS) [5] wasone of the first synchronization protocols to focus directly onWSNs, proposing the novel concept of time synchronizationon-demand. Elapsed Time on Arrival (ETA) [10] provides aset of application programming interfaces for an abstract timesynchronization service. RITS [9] is an extension of ETAover multiple hops. It incorporates a set of routing servicesto efficiently pass sensor data to a network sink node fordata fusion. In [11], mote-PC synchronization was achieved byconnecting the GPIO ports of a mote and IPAQ PDA using themote-NIC model. Although using this technique can achievemicrosecond-accurate synchronization, it was implemented asa hardware modification rather than in software.

Multi-modal tracking using audio and video sensors hasbeen studied mainly for videoconferencing applications. Pin-gali et al. [1] describe audio-visual tracking to find a speakeramong a group of individuals. They also argue that a jointAV tracking system is more desirable than each single modal-ity. Yoshimi [12] presents a distributed multi-modal trackingsystem using multiple cameras and microphones to select asingle speaker during a meeting. Zotkin [13] presents sim-ulation results of a multi-modal tracking system for smartvideoconferencing.

The rest of the paper is organized as follows. Section IIdescribes the time synchronization problem. In Section III, wepresent our methodology, implementation, and experimentalresults for time synchronization in heterogeneous sensor net-works. Section IV outlines our multi-model tracking systemand presents some experimental results. Section V concludes.

II. TIME SYNCHRONIZATION

In this section, we formulate the problem using a systemmodel for time synchronization and we describe the sourcesof synchronization error.

A. System Model

Each sensor node in a WSN is a computing device thatmaintains its own local clock. Internally, the clock is apiece of circuitry that counts oscillations of a quartz crystal,energized at a specific frequency. When a certain number ofthese oscillations occur, a clock-tick counter is incremented.This counter is accessible from the operating system and itsaccuracy (with respect to atomic time) depends on the qualityof the crystal, as well as various operating conditions such astemperature, pressure, humidity, and supply voltage. When a

sensor node registers an event of interest, it will access theclock-tick counter and record a timestamp reflecting the timeat which the event occurred.

We use the notation t to represent real (or atomic) time, andthe notation te to represent the time at which an arbitrary evente occurred. Because each node records a timestamp accordingto its own clock, we specify the local time on node Ni at whichevent e was detected by the timestamp Ni(te). Although thetwo timestamps Ni(te) and Nj(te) correspond to the samereal-time instant te, this does not imply that Ni(te) = Nj(te);the clock-tick counter on node Ni may still be offset fromNj . Therefore, from the perspective of node Ni, we define theclock offset with Nj at real-time t as φi

j(t) = Ni(t) − Nj(t).It may be the case that the offset changes over time. In otherwords, the clock rate of node Ni,

dNi(t)dt , may not equal the

ideal rate (dNi(t)dt = 1). We define the ratio between the clock

rates of the two nodes as the relative rate, rrij = dNi(t)

dNj(t).

The relative rate is a quantity directly related to the clockskew, which is defined as the difference between clock rates,and is used in our clock skew compensation methods. Werefer to clock offset and clock rate characteristics as the node’stimescale.

Knowledge of the clock offset and skew can be used tosynchronize a pair of nodes. In this work, we use the timescaletransformation approach. In this approach, references tablesare maintained to keep track of the clock offsets and skewbetween a node and its neighbors. Each reference table is thenused to transform a clock value from one timescale to another,providing each node with a common notion of time [14].In sensor networks, the timescale transformation approach ispreferable to physically change the value of the clock-tickcounter of a node to match that of another node [14].

B. Sources of Synchronization Error

Synchronization requires passing timestamped messagesbetween nodes. However, this communication has associatedmessage delay, which has both deterministic and nondeter-ministic components that introduce error into the timescaletransformation. We call the sequence of steps involved in com-munication between a pair of nodes the critical path. Figure1 illustrates the critical path in a wireless connection. Thecritical path is not identical for all configurations, however,it can typically be characterized by the steps outlined in thefigure (for more details, see for example [7], [6], [8]).

Fig. 1. Critical path

In both the mote and PC networks, the Send and Receivetimes are the delays incurred as the message is passed betweenthe application and the MAC layer. These segments are mostly

nondeterministic due to system call overhead and kernelprocessing, and delay can reach hundreds of milliseconds.The Access time is the least deterministic segment in thecritical path. It is the delay that occurs in the MAC layer whilewaiting for access to the communication channel. Dependingon network congestion, access time can take longer thanone second. The Transmission and Reception times are thedelays incurred from transmitting and receiving a messageover the physical layer, bit by bit. It can be as high as tens ofmilliseconds, however, it is mostly deterministic and dependson bit rate and message length. The Propagation time is thetime the message takes to travel between the sender andreceiver. This delay is highly deterministic and also the leastsignificant; a message can travel up to 300 meters in less thana microsecond.

The mote-PC serial connection is slightly different. TheSend and Receive times are similar to wireless communication,however Access time is noticeably different. Unlike wirelessnetworks, in which nodes must compete with a potentiallylarge number of neighbors for access to the channel, a serialconnection is only shared by two devices. Furthermore, induplex mode, simultaneous bidirectional communication ispossible, so Access time will not be affected by congestion.However, there will be a delay if the receiver is not ready toaccept data. To discover whether data can be sent, the sendersets the Request-To-Send (RTS) pin high on the serial port.If the receiver is able to accept data, it will set the Clear-To-Send (CTS) pin high, and data transfer can begin. Thus,the Access time is the delay that takes place from the timethe RTS pin is set by the sender to the time the CTS pinis set by the receiver. The Transmission time starts when thedata on the sender is moved in 16-byte chunks to the bufferon the UART, and ends after the data is transmitted bit-by-bit across the serial port to the receiver. Similar to wirelessnetworks, the Propagation time is minimal. The UART on thereceiver places the received bits into its own 16-byte buffer.When the buffer is almost full, or a timeout occurs, it sendsan interrupt to the CPU, which notifies the serial driver, andthe data is transferred to main memory where it is packagedand forwarded to the user-space application.

In addition to the error from message delay nondeterminism,synchronization accuracy is also affected by clock skew whena pair of nodes operate for extended periods of time withoutcorrecting their offset. For example, suppose at real-time t,nodes N1 and N2 exchange their current clock values atlocal times N1(t) and N2(t), respectively. At some later time,an event e occurs that is detected and timestamped by bothnodes, and N1 sends its timestamp N1(te) to N2. If the clockrates on each node were equal, N2 would simply be able totake the previously calculated offset and use it to transformthe event timestamp N1(te) to the corresponding local timeN2(N1(te)) = N1(te)+φ1

2(t). However, because of the clockskew, an attempt to convert N1(te) to the local timescale canpotentially result in an error, especially if the interval betweenthe last synchronization and the event was large.

Given the estimated clock offset and relative rate, a

timescale transformation, which converts an event timestampNj(te) from the timescale of node Nj to the timescale of Ni,can be defined as

Ni(Nj(te)) = Ni(ts) + rrij(ts)[(Nj(te)−Nj(ts)]

where Ni(ts) and Nj(ts) are the respective local times atwhich nodes Ni and Nj exchanged their most recent syn-chronization message, s.

III. TIME SYNCHRONIZATION IN HSNS

This section presents our methodology, implementation, andexperimental results for time synchronization in heterogeneoussensor networks.

A. Synchronization Techniques

Our goal is to provide accurate time synchronization toreactive data fusion applications in HSNs. Doing so requires anunderstanding of how the various components in the networkinteract. We refer to the combination of these components asa configuration. Our configuration consists of Mica2 motesand Linux PCs. There are two physical networks, the B-MAC network formed by the motes and the 802.11 networkcontaining the PCs. The link between the two is achievedby connecting a mote to a PC with a serial connection. Thismote-PC configuration is chosen because it is representative ofHSNs containing resource-constrained sensor nodes for mon-itoring the environment and resource-intensive computationalnodes.

Mote Network: Existing protocols that minimize synchro-nization error can be classified as sender-receiver, in whichone node synchronizes with another, or receiver-receiver, inwhich multiple nodes synchronize to a common event [15].Several sender-receiver synchronization protocols [7], [8], [9]have been developed for the Berkeley Motes and similarsmall-scale devices, and provide microsecond accuracy. Thisis possible because often these embedded devices supportoperating systems that are tightly integrated with the radiostack, enabling timestamping directly at the network interface,and thus bypassing the majority of nondeterministic messagedelay in the critical path. For example, in RITS, sender-side synchronization error is essentially eliminated by takingthe timestamp and inserting it into the message after themessage has already begun transmission. On the receiverside, a timestamp is taken upon message reception, and thedifference between these two timestamps estimate the clockoffset. RITS was shown to synchronize a pair of nodes with anaverage error of 1.48 µs and an accumulated error of 0.5 µs perhop [9]. The accuracy of a receiver-receiver synchronizationprotocol such as RBS (discussed below) is comparable tosender-receiver synchronization in a mote network. However,it has greater associated communication overhead, which canshorten the lifetime of the network. Therefore, we selectedRITS to synchronize the mote network in our HSN.

PC Network: Synchronization of PC networks has beenstudied extensively over the past four decades, however, pop-ular sender-receiver algorithms such as the Network TimeProtocol (NTP) [16] only provide millisecond accuracy. This isgenerally acceptable because PC users typically do not requiregreater synchronization precision for their applications. RBS isa popular receiver-receiver protocol that was found to achievesynchronization accuracy in the order of microseconds over802.11 networks. Rather than synchronizing clocks to eachother, participating nodes timestamp the arrival of a specialmessage broadcast over the network. By exchanging thesetimestamps, neighboring nodes are able to maintain referencetables for timescale transformation. Receiver-receiver synchro-nization minimizes error by taking advantage of the broadcastchannel found in most networks. Messages broadcast at thephysical layer will arrive at a set of receivers within a tighttime bound due to the almost negligible propagation time ofsending an electromagnetic signal through air. This removesthe send and access time nondeterminism from the criticalpath. The only remaining significant source of nondeterminismis due to the receive time, the majority of which can beremoved by timestamping the arrival of the reference beaconin the kernel interrupt function. Because receiver-receiver syn-chronization provides greater accuracy in 802.11 PC networks,we selected RBS as our synchronization mechanism for ourLinux PCs.

Mote-PC Connection: There are two ways the interfacebetween the mote and PC networks can be modeled. In thefirst, the gateway mote acts as a network interface controller(NIC) for the PC it is connected to. This generally impliesthere is no separate critical path between the gateway moteand PC, but instead a unique critical path exists between themote network and mote-PC pair. The second model places thecritical path between the gateway mote and PC. We chose thelatter because it gives us greater control over minimizing thenondeterministic message delay. We selected Elapsed Time onArrival (ETA) [10], the underlying synchronization mechanismused in RITS. On the mote, a timestamp is taken upon transferof a synchronization byte and inserted into the outgoing mes-sage. On the PC, a timestamp is taken immediately after theUART issues the interrupt, and the PC regards the differencebetween these two timestamps as the PC-mote offset, φpc

mote.The mote timestamp is taken at the latest time possible in thecritical path, and the PC timestamp at the earliest. A timestampis taken by the mote upon successful transmission over theserial connection of a synchronization byte and appended tothe end of the message. Simultaneously, as the synchronizationbyte is received at the PC, a timestamp is taken in theserial driver at the kernel level. Serial communication bit ratebetween the mote and PC is 57600 baud, which approximatelyamounts to a transfer time of 139 microseconds per byte.However, the UART will not issue an interrupt to the CPUuntil its 16-byte buffer nears capacity or a timeout occurs.Because the synchronization message is six bytes, receptiontime in this case will consist of the transfer time of the entiremessage in addition to the timeout time and the time it takes

to transfer the date from the UART buffer into main memoryby the CPU. This time is compensated for by the receiver, andthe clock offset between the two devices is determined as thedifference between the PC receive time and the mote transmittime.

B. Clock Skew Compensation

Independent of synchronization protocol, there are severaloptions for clock skew compensation. In some applications,event detection and synchronization always occur so closetogether in time that clock skew compensation is unneces-sary. However, when it does become necessary, nodes mustexchange timestamps periodically to ensure their clocks donot drift too far apart. In resource-constrained WSNs, this maybe undesirable because it can result in high message overhead.To keep message overhead at a minimum, nodes can exchangesynchronization messages less frequently and instead maintaina history of their neighbor’s timestamps. Statistical techniquescan then be used to produce an accurate estimate of clockoffset at any time instant.

Linear Regression: A linear regression fits a line to a setof data points such that the square of the error between theline and each data point is minimized overall. By maintaininga history of n local-remote timestamp pairs {Ni(t1), Nj(t1)},{Ni(t2), Nj(t2)}, . . . , {Ni(tn), Nj(tn)}, node Ni can derivea linear relation Ni(t) = α + βNj(t) and solve for thecoefficients α and β. Here, β represents an estimation ofrri

j(tk), and can be calculated by

β(tk) =∑n

k=1 (Ni(tk)−Ni(tk))(Nj(tk)−Nj(tk))∑nk=1 (Nj(tk)−Nj(tk))2

where Ni(tn) is the average of {Ni(t1),...,Ni(tn)}. Onedisadvantage of linear regression is that outliers have a strongimpact on the result because data points are weighted by thesquare of their error to the fitted line. In some instances,however, these outliers can be removed beforehand. Anotherproblem arises when attempting to improve the quality of theregression by increasing the number of data points. This canresult in high memory overhead, especially in dense networks.However, it has been shown in [8] that sub-microsecond clockskew error can be achieved with as few as six timestamps inmote networks.

Exponential Averaging: Exponential averaging solves theproblem of high memory overhead by keeping track of onlythe current relative rate and the most recent neighbor-localsynchronization timestamp pair. When a new timestamp pairis available, the relative rate is updated by

rrij(tk) = α ∗ Ni(tk)−Ni(tk−1)

Nj(tk)−Nj(tk−1)+ (1− α) ∗ rri

j(tk−1),

where α is a small weight constant (typically close to 0.1)that gives higher significance to the accumulated average.The value of α affects the convergence time. In addition,because each relative rate estimate is partially derived fromits previous value, there will be a longer convergence timebefore an accurate estimate is reached. This can be reduced by

providing the algorithm with an initial relative rate, determinedexperimentally.

Phase-Locked Loop: Phase-locked loops (PLL) is amechanism for clock skew compensation used in NTP [17].The PLL compares the ratio of a current local-remote times-tamp pair with the current estimate of relative rate. The PLLthen adjusts the estimate by the sum of a value proportionalto the difference and a value proportional to the integral ofthe difference. PLLs generally have a longer convergence timethan linear regression and exponential averaging, but have lowmemory overhead. A diagram of our PLL implementation isillustrated in figure 2. There are three main components. ThePhase Detector calculates the relative rate between two nodesand compares this with the output of the PLL, which is theprevious estimate of the relative rate. The difference betweenthese two values is the phase error, which is passed to thesecond-order Digital Loop Filter. Because we expect there tobe some amount of phase error, we choose a filter with anintegrator, which allows the PLL to compensate for it such thatthere is no remaining steady-state phase error. To implementthis behavior in software, a digital accumulator is used, and isrepresented by y(t) = (K1 + K2)u(t) − 10K2K1u(t − 1) +10K2y(t − 1). The resulting static phase error is passed tothe Digitally Controlled Oscillator (DCO). The DCO sums theprevious phase error with the previous output, which producesthe current estimate of relative clock rate, and is fed back intothe PLL. Techniques for selecting the gains are presented in[17].

Fig. 2. Phase-locked loop

C. Experimental Setup

Our testbed consists of seven Crossbow Mica2 motes andfour stationary ActiveMedia Pioneer robots with embeddedLinux PCs, as illustrated in figure 3. In addition we employan OpenBrick-E Linux server to transmit RBS beacons.

A PC-based reference broadcast node transmits a referencebeacon containing a sequence number once every ten seconds.The arrival of these messages are timestamped in the kerneland stored in a reference table. Simultaneously, a designatedmote broadcasts event beacons, once every four seconds. Sixhundred event beacons are broadcast per experiment. Themotes timestamp the arrival of the event beacon, and place thetimestamp into a message. The message is routed over threehops in the mote network to the mote-PC gateways using RITS

Fig. 3. Our sensor network testbed. Arrows indicate communication flow.

to convert the timestamp to the local timescale with each hop.The message is next transferred from the mote network to PCnetwork over the mote-PC serial connection, and the eventtimestamp is again converted to the timescale of the gatewayPC. The gateway PCs forward the message two additional hopsto the sensor-fusion node.

In the PC network, the event message includes additionalfields, specifically the most recent reference broadcast arrivaltimestamp and sequence number. These allowed us to reducethe message overhead of RBS. PCs receiving these messagesare able to convert the event timestamp into their localtimescale by comparing the sender PC reference broadcasttimestamp with their own, adjusting the event timestamp bythe offset, and compensating for clock skew. The experimentwas repeated using the different clock skew compensationtechniques described in Section II.

The implementation used in these experiments was bundledinto a PC-based time synchronization service for reactive datafusion applications in HSNs. The service accepts timestampedevent messages on a specific port, converts the embeddedtimestamp to the local timescale, and forwards the messagetoward the sensor-fusion node. The mote implementationuses the TimeStamping interface, provided with the TinyOSdistribution [18].

D. Experimental Results

Mote-PC Synchronization: To evaluate synchronizationaccuracy between the mote and the PC, GPIO pins on the moteand PC were connected to an oscilloscope, and set high upontimestamping. The resulting output signals were captured andmeasured. The test was performed over 100 synchronizations,and resulting error was 7.32µs on average. The results aredisplayed in figure 4. The majority of the error is due tonondeterministic message delay resulting from jitter, both inthe UART and the CPU.

HSN Synchronization: Figure 5 summarizes the syn-chronization error for each type of clock skew compensationtechnique. The results were obtained as follows: Two nodes,N1 and N2, simultaneously timestamp the occurrence of anevent, such as a reference beacon. These timestamps are thentransformed to the timescale of node N3, and the absolutevalue of their difference, |N3(N1(te)) − N3(N2(te))|, repre-sents the error in the timescale transformation. In the case

Fig. 4. Mote-PC error.

when no clock skew compensation was used, the averagedifference between the source timestamps was 85.66µs, witha maximum of 689µs. Linear regression was implementedwith a history size of 8 local-remote timestamp pairs foreach neighbor, and the relative rate was initialized to 1. Asexpected, there was a notable improvement in synchronizationaccuracy, with an average of 6.89µs error, with a maximumof 55µs. Using exponential averaging gives error similar tolinear regression. For exponential averaging, we chose a valueof 0.10 for α, and initialized the average relative rate to 1.The average error recorded was 7.45µs, with a maximum of47µs. The average synchronization error using phase-lockedloops was 7.85µs, with a maximum of 574µs. For the digitalloop filter, we used gains of K1 = 0.1 and K2 = 0.09.

Fig. 5. HSN synchronization error.

Each experiment was run under both high and low networkcongestion and I/O load conditions. To simulate a high net-work load, an extra mote and PC (not pictured in figure 3)were introduced, each broadcasting messages of random sizeevery 100 milliseconds. The results for each type of clock skewcompensation in the presence of network congestion and I/Oload show that synchronization accuracy is not overtly affectedby these strenuous conditions in the network.

The majority of synchronization error in the HSN comesfrom RBS synchronization in the PC network. This was prin-cipally caused by a small number of synchronization attemptsin which prolonged operating system interrupt operations

occurred. Although these were difficult to avoid, they did notoccur frequently, and the worst-case synchronization error foreach type of clock skew compensation technique was sufficientfor most kinds of HSN reactive data fusion applications.

IV. TIME SYNCHRONIZATION FOR MULTI-MODALTRACKING

In this section, we briefly describe the our multi-modaltarget tracking system. We also present the software architec-ture which includes time-synchronization methods introducedin the paper. Finally, we present experimental results thatdemonstrate the improvement on the tracking performance.

Our multi-modal target tracking system is shown in Figure6. We employ 5 audio sensors in a 36.5×15ft grid and 3 videosensors deployed on either side of a road. The objective of thesystem is to detect and track vehicles and people using boththe audio and video data. The detection and tracking resultsare available at the sensor fusion center, which can then bemade available to other applications.

A. Audio Sensing

Beamforming methods have been successfully applied todetect single or even multiple acoustic sources in noisy andreverberant areas. Beamforming takes advantage of the factthat the distance from the source to each microphone in abeamforming array is different, which means that the signalsrecorded by each microphone will be phase-shifted replicasof each other. The amount of phase-shift at each microphonein the array is dependent on the microphone arrangement andlocation of the source. A typical delay-and-sum beamformerdivides the sensing region into beams. For each beam, as-suming the source is located in that direction, the microphonesignals are delayed according to the phase-shift and summedtogether to get a composite signal. The square-sum of thecomposite signal is the beam energy. Signal energies arecomputed for each of the beams, called the beamform and themaximum energy beam indicates the direction of the acousticsource.

The data-flow diagram of the beamformer used in oursystem is shown in Figure 7. The amplified microphone signalis sampled at 1MHz to provide high resolution for the delaylines, required by the closely placed microphones. The rawsignal is filtered to remove unwanted noise components andprovide a band limited signal for down sampling in a laterstage. The signal is then fed to a tapped delay line (TDL),which has M different outputs to provide the required delaysfor each of the M beams. The signal is down sampled andM beams are formed by adding the four delayed signalstogether. From the data streams blocks are formed and an FFTis computed for each block. A frequency selection componentis used to select the frequencies of interest. The selected signalis then summed to compute the block power value µi, whichis then smoothed by exponential averaging into beam powerλi:

λi(k) = αλi(k − 1) + (1− α)µi.

Audio SensorsVideo Sensors

Sensor Fusion Center

36.5 ft

15 ft

92 ft198 ft

Audio Sensors

Video Sensor

Fig. 6. Experimental setup

Fig. 7. Data-flow diagram of the audio beamforming algorithm

As an example, Figure 8 shows the normalized beam powerof the audio signal generated by a person speaking a fewmeters from the sensor as measured on a street. The directionof the speaker is clearly detected, but it is also visible thatthe detection is not very sharp (due to the small number ofmicrophones in the array).

The audio sensing is implemented on a sensor node withan array of four microphones shown in Figure 9. The sensornode is based on a MicaZ mote and on an onboard FPGA chipthat is used to implement the beamformer [19].

B. Video Sensing

We use a moving object detection algorithm for videosensing. The dataflow in Figure 10 illustrates the algorithm andits components. The first step of moving object detection isbackground-foreground segmentation of the currently capturedframe (It) from the camera. We use the algorithm describedin [20] for background-foreground segmentation which isbased on an adaptive background mixture model for real-timetracking. The mixture method models each background pixelby a mixture of K Gaussian distributions.

The foreground image (Ft) from the segmentation processis passed through a median filter which reduces specklenoise present in the foreground and smooths the foregroundimage. The foreground then goes through opening and closingmorphological operations. Opening removes small featurespresent in the foreground, while the closing operation fillssmall holes. Finally, the enhanced foreground image undergoesimage segmentation to find connected components. Each of theconnected components is enclosed by a rectangular boundingbox which represent a moving object (Ot).

We have implemented the moving object detection algo-rithm in OpenCV by Intel [21]. In our application, OpenBrick-E Linux servers are running the video-based detection at 8frames-per-second using video with 320×240 pixel resolution.

C. Sensor Fusion

The objective of our sensor fusion algorithm is to combinethe detection values from both audio and video sensors.The acoustic sensors provide beam power values λj , j =0, 1, ...,M , λj corresponding to directions φj = j 2π

M , whereM is the resolution of the sensor. The video sensors provide

Fig. 8. The normalized power of an audio signal

Sensor Node

10 cm

6 cm

Fig. 9. Audio sensor node

detection angle ranges [αl, βl] for each detected object l,l = 0, 1, ..., L, where L is the number of perceived objectsby the given sensor.

In order to perform sensor fusion, we define a detectionfunction Ξ(φ) that formalizes the object detection for bothtypes of sensors. For the acoustic sensors, we have

Ξ(φ) = λj , j : |φj − φ| mod 2π <π

M

while for the video sensors

Ξ(φ) ={

1 if αl < φ < βl, for any l, l = 0, 1, ..., L0 otherwise .

Using the detection functions and the known locations andorientations of the sensors, the proposed algorithm computesa consistency function defined over the search space (2-dim

Gaussian Bg/Fg Segmentation

Opening & Closing

Segmentation & Bounding Boxes

It

Bt

OtFt Median

Filter

Fig. 10. Data-flow diagram of real-time moving object detection

plane in our case). Let K be the number of sensors placedon the plane at positions (xk, yk), k = 1, ..,K. The searchspace is divided into N rectangular regions with center points(Xi, Yi) and side lengths of ∆xi and ∆yi, i = 1, ..., N asillustrated in Figure 11. The consistency function for each

0

0.2

0.4

0.6

0.8

1

0

0.1

0.2

0.30.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F(x1,y1,x2,y2)

(x1,y1)

(x2,y2)

Acoustic Sensors

Video Sensors

Fig. 11. Multi-modal sensor fusion. The sensing region is divided intorectangular regions and the consistency function is calculated for each regionusing acoustic and video data.

region i is defined as

Ci = ΣKk=1w

(k,i)Ξ(k,i)

where w(k,i) is a weighting factor discussed later and theΞ(k,i) value is the contribution of the sensor k in the ith

rectangular region. The value of the consistency functionfor a region is a measure of the likelihood that an objectis present at that region. Therefore, our tracking algorithmestimates the position of the object (or multiple objects) bycomputing the peaks of the consistency function and returningthe corresponding coordinates.

D. System Architecture

Figure 12 shows the audio-visual sensor fusion systemarchitecture. The audio sensors are periodically sending theirtimestamped detection functions to the base station. The basestation for each of the audio nodes and the video sensornodes send their timestamped detection functions to the sensorfusion node through a time-synchronization daemon. Thedaemon uses the timescale transformation approach presentedin the previous section for converting the timestamps to thelocal clock and stores the values of the detection function toappropriate buffers, one for each sensor in the system.

A sensor fusion scheduler triggers periodically and gener-ates a timestamp. The trigger is used to retrieve the detectionfunction values from the sensor buffers which are closestto the generated timestamp. The retrieved detection function

τ, Ξτ1, Ξ1

…τn, Ξnτ1, Ξ1

…τn, Ξnτ1, Ξ1

…τn, Ξnτ1, Ξ1

…τn, Ξn

Scheduler

Consistency function

(t) Ξ1(t)Ξ2(t)Ξk-1(t)Ξk(t)

X(t)

Tim

e sy

nch

roni

zatio

n d

aem

on

Daemon

τ, ΞDaemon

τ, Ξτ, ΞN

etw

orkda

em

on

Base Station

Sensor Fusion Node

AS#1

AS#2

VS#1

VS#2

AS#1

AS#2

VS#1

VS#2

Fig. 12. Multi-modal tracking system architecture

values are then used to compute the consistency function forthe sensing region. The maxima of the consistency functionindicate the presence of a target. Note that the scheduler inthis architecture decouples the tracking rate, at which theconsistency function is computed, and the audio/video sensingrates.

E. Experimental Results

We have performed a series of experiments tracking peopleand/or vehicles and we present a few representative results.We show the consistency function in the sensing region (darkregions indicate higher value) for three scenarios. For eachscenario, we plot the consistency function obtained only bythe audio sensors and by both audio and video sensors toillustrate the advantages of multi-modal tracking. Figure 13shows the results for experiment where a speaker was insidethe network. Figure 14 shows results for an experiment witha speaker outside the network. Figure 15 shows the resultsfor multiple speakers. The results demonstrate that using thevideo sensors results in a drastic decrease in the variance of theestimated position (the regions with high-consistency valuesare much smaller).

V. CONCLUSION

We have developed a methodology for accurate time syn-chronization in heterogeneous sensor networks. The synchro-nization methodology has been implemented as a networkservice and used in a multi-modal tracking application forperforming fusion of audio and video data collected fromheterogeneous sensor nodes. In our future work, we intend toexplore more advanced tracking algorithms that use recursiveand/or distributed methods as well as consider mobile sensornodes in the network.Acknowledgements This work was supported in part by AROMURI grant W911NF-06-1-0076, NSF CAREER award CNS-0347440, and a Vanderbilt University Discovery Grant.

REFERENCES

[1] G. Pingali, G. Tunali, and I. Carlbom, “Audio-visual tracking for naturalinteractivity,” in ACM Multimedia, 1999.

[2] M. Yarvis, N. Kushalnagar, H. Singh, A. Rangarajan, Y. Liu, andS. Singh, “Exploiting heterogeneity in sensor networks,” in IEEE IN-FOCOM, 2005.

[3] E. Duarte-Melo and M. Liu, “Analysis of energy consumption andlifetime of heterogeneous wireless sensor networks,” in IEEE Globecom,2002.

[4] L. Lazos, R. Poovendran, and J. A. Ritcey, “Probabilistic detection ofmobile targets in heterogeneous sensor networks,” in IPSN, 2007.

[5] K. Romer, “Time synchronization in ad hoc networks,” in ACM Sympo-sium on Mobile Ad-Hoc Networking and Computing, 2001.

[6] J. Elson, L. Girod, and D. Estrin, “Fine-grained network time synchro-nization using reference broadcasts,” in Fifth Symposium on OperatingSystems Design and Implementation (OSDI), 2002.

[7] S. Ganeriwal, R. Kumar, and M. B. Srivastava, “Timing-sync protocolfor sensor networks,” in ACM SenSys, 2003.

[8] M. Maroti, B. Kusy, G. Simon, and A. Ledeczi, “The flooding timesynchronization protocol,” in ACM SenSys, 2004.

[9] J. Sallai, B. Kusy, A. Ledeczi, and P. Dutta, “On the scalability ofrouting integrated time synchronization,” in Third European Workshopon Wireless Sensor Networks (EWSN), 2006.

[10] B. Kusy, P. Dutta, P. Levis, M. Maroti, A. Ledeczi, and D. Culler,“Elapsed time on arrival: A simple and versatile primitive for time syn-chronization services,” International Journal of Ad hoc and UbiquitousComputing, vol. 2, no. 1, January 2006.

[11] L. Girod, V. Bychkovsky, J. Elson, and D. Estrin, “Locating tiny sensorsin time and space: A case study,” in ICCD, 2002.

[12] B. H. Yoshimi and G. S. Pingali, “A multimodal speaker detection andtracking system for teleconferencing,” in ACM Multimedia, 2002.

[13] D. N. Zotkin, R. Duraiswami, H. Nanda, and L. S. Davis, “Multimodaltracking for smart videoconferencing,” in IEEE International Conferenceon Multimedia and Expo (ICME), 2001.

[14] J. Elson and K. Romer, “Wireless sensor networks: A new regime fortime synchronization,” in Proceedings of the First Workshop on HotTopics in Networks, 2002.

[15] K. Romer, P. Blum, and L. Meier, “Time synchronization and calibrationin wireless sensor networks,” in Wireless Sensor Networks, I. Stojmen-ovic, Ed. Wiley and Sons, 2005.

[16] D. L. Mills, “Internet time synchronization: The network time protocol,”IEEE Transactions on Communications, vol. 39, no. 10, 1991.

[17] ——, “Modelling and analysis of computer network clocks,” ElectricalEngineering Department, University of Delaware, Tech. Rep. 92-5-2,1992.

[18] P. Levis and et al., “The emergence of networking abstractions andtechniques in tinyos,” in NSDI, 2004.

[19] P. Volgyesi and et al., “Shooter localization and weapon classificationwith soldier-wearable networked sensors,” in Mobisys, 2007.

[20] P. KaewTraKulPong and R. B. Jeremy, “An improved adaptive back-ground mixture model for realtime tracking with shadow detection,” in2nd European Workshop on Advanced Video Based Surveillance Systems(AVBS), 2001.

[21] Open source computer vision library. Intel. [Online]. Available:http://www.intel.com/technology/computing/opencv/

without video sensors

with video sensors

Fig. 13. Consistency function for a speaker inside the network


with video sensors

Fig. 14. Consistency function for a speaker outside the network


with video sensors

Fig. 15. Consistency function for multiple speakers

time synchronization for multi-modal target tracking in heterogeneous sensor networks

Documents