multi-modal target tracking using heterogeneous sensor ... · fig. 2. data-ﬂow diagram of the...

Multi-modal Target Tracking using HeterogeneousSensor Networks

Manish Kushwaha, Isaac Amundson, Peter Volgyesi, Parvez Ahammad†,Gyula Simon‡, Xenofon Koutsoukos, Akos Ledeczi, Shankar Sastry†

Institute for Software Integrated SystemsDepartment of Electrical Engineering

and Computer ScienceVanderbilt UniversityNashville, TN, USA

Email: [email protected]

†Department of Electrical Engineeringand Computer Science

University of California - BerkeleyBerkeley, CA, USA

‡Department of Computer ScienceUniversity of Pannonia

Veszprem, Hungary

Abstract— The paper describes a target tracking system run-ning on a Heterogeneous Sensor Network (HSN) and presentsresults gathered from a realistic deployment. The system fusesaudio direction of arrival data from mote class devices andobject detection measurements from embedded PCs equippedwith cameras. The acoustic sensor nodes perform beamformingand measure the energy as a function of the angle. The cameranodes detect moving objects and estimate their angle. The sensordetections are sent to a centralized sensor fusion node via acombination of two wireless networks. The novelty of our systemis the unique combination of target tracking methods customizedfor the application at hand and their implementation on an actualHSN platform.

I. INTRODUCTION

Heterogeneous Sensor Networks (HSN) are gaining popu-larity in diverse fields [23]. They are natural steps in the evo-lution of Wireless Sensor Networks (WSN) because they cansupport multiple, not necessarily concurrent, applications thatmay require diverse resources. Furthermore, as WSNs observemore complex phenomena, multiple sensing modalities be-come necessary. Different sensors can have different resourcerequirements in terms of processing, memory, or bandwidth.Instead of using a network of homogeneous devices supportingresource intensive sensors, an HSN can have different nodesfor different sensing tasks.

Target tracking is an application that can benefit frommultiple sensing modalities [6]. If the moving object emitssome kind of sound then both audio and video sensors can beutilized. These modalities can complement each other in thepresence of high background noise that impairs the audio orvisual clutter affecting the video.

In this paper, we describe our ongoing work in targettracking in urban environments utilizing an HSN of mote classdevices equipped with acoustic sensor boards and embeddedPCs equipped with web cameras. The moving targets to betracked are vehicles emitting engine noise. Our system hasmany components including audio processing, video process-ing, WSN middleware services, and multi-modal sensor fusionbased on a sequential Bayesian estimation framework. While

some of these components are not necessarily novel, theircomposition and implementation on an actual HSN requiresaddressing a number of significant challenges.

We have implemented audio beamforming on mote classdevices utilizing an FPGA-based sensor board and we haveevaluated the performance of the audio nodes as well theirenergy consumption. While we are using a standard objectdetection algorithm based on video, using post-processing ofthe measurements, we allow the fusion of audio and videomeasurements. Further, we have extended time synchroniza-tion techniques to HSN consisting of a mote and a PC network.Finally, the main challenge we addressed is system integrationas well as making the system work on the actual platform ina realistic deployment scenario. The paper provides resultsgathered in an uncontrolled urban environment and presents athorough evaluation including a comparison of different fusionapproaches for different combination of sensors.

The rest of the paper is organized as follows. The nextsection describes the overall system architecture. It is followedby the description of the audio and then the video processingapproach. In Section V we present the time synchronizationapproach for HSNs. Next the multimodal tracking algorithm ispresented. The experimental evaluation is described in SectionVII followed by a summary of related work. Finally, wediscuss the lessons learned and future directions.

II. ARCHITECTURE

Figure 1 shows the system architecture. The audio sen-sors, consisting of MICAz motes with acoustic sensor boardsequipped with 4-microphone array, form an 802.15.4 network.The video sensors are based on Logitech QuickCam Pro 4000cameras attached to OpenBrick-E Linux embedded PCs. Thesevideo sensors, the mote-PC gateways, the sensor fusion nodeand the reference broadcaster for time synchronization are allPCs forming a peer-to-peer 802.11b wireless network.

The audio nodes perform beamforming. The detections aresent to the corresponding mote-PC gateway utilizing a multi-hop message routing service that also performs time translation

τ1, λ1

…τn, λn

τ1, λ1

…τn, λn

τ1, λ1

…τn, λn

τ1, λ1

…τn, λn

Scheduler

Sequential BayesianEstimation (SBE)

(t)λ1(t)

λ2(t)

λk-1(t)

λk(t)

X(t)

Tim

e sy

nch

roni

zatio

n d

aem

on

τ, λdaemon

τ, λ

τ, λ

Net

wor

k

dae

mo

n Base Station

Sensor Fusion NodeAS#1

AS#2

VS#1

VS#2

AS#1

AS#2

VS#1

VS#2

τ, λdaemon

Fig. 1. Multimodal tracking system architecture

of the detection timestamps. The video sensors run a motiondetection algorithm and compute timestamped detection func-tions. Both audio and video detections are routed to the centralsensor fusion node and stored in appropriate sensor buffers(one for each sensor). In Figure 1, τ denotes measurementtimestamps, λ denotes detection functions described in latersections, and t denotes time. The blocks shown inside thesensor fusion node are circular buffers that store timestampedmeasurements. A sensor fusion scheduler triggers periodicallyand generates a fusion timestamp. The trigger is used toretrieve the sensor measurement values from the sensor bufferswith timestamps closest to the generated fusion timestamp.The retrieved sensor measurement values are then used formultimodal fusion based on sequential Bayesian estimation.

III. AUDIO BEAMFORMING

Beamforming is a spacetime operation in which a waveformoriginating from a given source is received at spatially sepa-rated sensors and combined in a time-synchronous manner[4]. In a typical beamforming array, each of the spatiallyseparated microphones receives a phase-shifted source signal.The amount of phase-shift at each microphone in the arrayis dependent on the microphone arrangement and the locationof the source. A typical delay-and-sum beamformer dividesthe sensing region into directions, or beams. For each beam,assuming the source is located in that direction, the micro-phone signals are delayed according to the phase-shift andsummed together into a composite signal. The square-sum ofthe composite signal is the beam energy. Beam energies arecomputed for each of the beams, and are collectively calledthe beamform. The beam with maximum energy indicates thedirection of the acoustic source.

Beamforming Algorithm: The data-flow diagram of ourbeamformer is shown in Figure 2. The amplified microphonesignal is sampled at a high sampling frequency (100 KHz) toprovide high resolution for the delay lines, which is requiredby the closely placed microphones. The raw signals are filteredto remove unwanted noise components and provide band-limited signals for down sampling at a later stage. The signal isthen fed to a tapped delay line (TDL), which has M differentoutputs to provide the required delays for each of the Mbeams. The delays are set by taking into consideration theexact relative positions of the four microphones so that the

Fig. 2. Data-flow diagram of the real-time beamforming sensor

resulting beams are steered to angles θi = i 360M degrees, i =0, 1, ...M − 1. The signal is downsampled and the M beamsare formed by adding the four delayed signals together. Datablocks are formed from the data streams (with a typical blocklength of 5-20ms) and an FFT is computed for each block. Aprogrammable selector selects the required components fromthe computed frequency lines to compute the total power of theblock. The frequency selection procedure provides flexibilityfor handling different kinds of sources, with its particularfrequency spectrum. Note however, that the fixed microphonetopology has significant impact on the performance in differentfrequency bands. The selector can be reprogrammed at run-time to adapt to the nature of the source. The block powervalues, µ(θi), are smoothed by exponential averaging into thebeam energy:

λt(θi) = αλt−1(θi) + (1− α)µ(θi) (1)

where α is an averaging factor.Audio Hardware: In our application, the audio sensor node

is a MICAz mote with an onboard Xilinx XC3S1000 FPGAchip that is used to implement the beamformer [22]. Theonboard Flash (4MB) and PSRAM (8MB) modules allowstoring raw samples of several acoustic events. The board sup-ports four independent analog channels, featuring an electretmicrophone each, sampled at up to 1 MS/s (million samplesper second). A small beamforming array of four microphonesarranged in a 10cm×6cm rectangle was placed on the sensornode, as shown in Fig. 3. Since the distances between themicrophones are small compared to the possible distancesof sources, the sensors perform far-field beamforming. Thesources are assumed to be on the same two-dimensionalplane as the microphone array, thus it is sufficient to performplanar beamforming by dissecting the angular space into Mequal angles, providing a resolution of 360/M degrees. In theexperiments, the sensor boards were configured to performsimple delay-and-sum-type beamforming in real time withM = 36, and an angular resolution of 10 degrees. Finerresolution increases the communication requirements.

Messages containing the audio detection functions require83 bytes, and include node ID, sequence number, times-tamps, and 72 bytes for 36 beam energies. These data aretransmitted through the network in a single message. Thedefault TinyOS message size of 36 bytes was changed to 96

Sensor Node

10 cm

6 cm

Fig. 3. Sensor Node Showing the Microphones

bytes to accommodate the entire audio message. The currentimplementation uses less than half of the total resources (logiccells, RAM blocks) of the selected mid-range FPGA device.The application runs at 20 MHz, which is relatively slowin this domain—the inherently parallel processing topologyallows this slow speed. Nonetheless, the FPGA approach hasa significant impact on the power budget, the sensor draws130mA current (at 3.3 V) which is nearly a magnitude higherthen typical wireless sensor node power currents.

IV. VIDEO TRACKING

Video tracking systems seek to automatically detect movingobjects and track their movements in a complex environment.Due to the inherent richness of the visual medium, videobased tracking typically requires a pre-processing step thatfocuses attention of the system on regions of interest in orderto reduce the complexity of data processing. This step issimilar to the visual attention mechanism in human observers.Since the region of interest is primarily characterized byregions containing moving objects, robust motion detectionis a key first step in video tracking. A simple approach tomotion detection from video data is via frame differencing. Itcompares each incoming frame with a background model andclassifies the pixels of significant variation into the foreground.The foreground pixels are then processed for identification andtracking. The success of frame differencing depends on therobust extraction and maintenance of the background model.Performance of such techniques tends to degrade when thereis significant camera motion, or when the scene has significantamount of change.

There exist a number of challenges for the estimationof robust background model including gradual illuminationchanges (e.g. sunlight), sudden illumination changes (e.g.lights switched on), vacillating backgrounds (e.g. swayingtrees), shadows, sleeping person phenomenon, waking personphenomenon, visual clutter, and occlusion [21]. Because thereis no single background model that can address all thesechallenges, the model must be selected based on applicationrequirements.

Algorithm: The dataflow in Figure 4 shows the motion de-tection algorithm and its components used in our tracking ap-plication. The first component is background-foreground seg-mentation of the currently captured frame (It) from the cam-era. We use the algorithm described in [12] for background-foreground segmentation. This algorithm uses an adaptive

background mixture model for real-time background and fore-ground estimation. The mixture method models each back-ground pixel as a mixture of K Gaussian distributions. Thealgorithm provides multiple tunable parameters for desiredperformance. In order to reduce speckle noise and smooth theestimated foreground (Ft), the foreground is passed through amedian filter. In our experiments, we used a median filter ofsize 3× 3.

Gaussian bg/fg segmentation

Median filter (3x3)

Detection function

DtIt

Bt

Ft Ft

Fig. 4. Data-flow diagram of real-time motion detection algorithm

Since our sensor fusion algorithm (Section VI) utilizesonly the angle of moving objects, we use video sensors asdirectional devices. We essentially convert the 2D image frameto a 1D detection function that captures the angle informationof moving objects. This approach provides us with a 30times reduction in communication bandwidth usage, while onthe downside, we lose a dimension of information that canpotentially be used in fusion.

Similar to the beam angle concept in audio beamforming(Section III), the field-of-view of the camera is divided intoM equally-spaced angles

θi = θmin + (i− 1)θmax − θmin

M: i = 1, 2, ...,M

where θmin and θmax are the minimum and maximum field-of-view angles for the camera. The detection function valuefor each beam direction is simply the number of foregroundpixels in that direction. Formally, the detection function forthe video sensors can be defined as

λ(θi) =W∑j∈θi

H∑k=1

F (j, k) : i = 1, 2, ...,M (2)

where F is the binary foreground image, H ,W are the verticaland horizontal resolutions in pixels, and j ∈ θi indicatescolumns in the frame that fall within angle θi.

Video Post-processing: In our experiments, we gatheredvideo data of vehicles from multiple video sensors froman urban street setting. The data contained a number ofreal-life artifacts such as vacillating backgrounds, shadows,sunlight reflections and glint. The algorithm described abovewas not able to filter out such artifacts from the detections.We implemented two post-processing filters to improve thedetection performance. The first filter removes any undesirablepersistent background. The second filter removes any sharpspikes (typically caused by sunlight reflections and glint). Forthis we convolved the detection function with a small linearkernel to add a blurring effect.

We implemented the motion detection algorithm usingOpenCV (open source computer vision) library. Our motion

detection algorithm implementation runs at 4 frames-per-second and 320× 240 pixel resolution. The number of beamangles is M = 160.

V. TIME SYNCHRONIZATION

In order to seamlessly fuse time-dependent audio andvideo sensor data for tracking moving objects, participatingnodes must have a common notion of time. Although severalmicrosecond-accurate synchronization protocols have emergedfor wireless sensor networks (e.g. [7], [9], [16], [18]), achiev-ing accurate synchronization in a heterogeneous sensor net-work is not a trivial task. We employ a hybrid approach, whichpairs a specific network with the synchronization protocol thatprovides the most accuracy with the least amount of overhead.

Mote Network: We used Elapsed Time on Arrival (ETA)[13] to synchronize the mote network. ETA timestamps mes-sages at transmit and receive time, thus removing the largestamount of nondeterministic message delay from the commu-nication pathway. We evaluated synchronization accuracy inthe mote network with the pairwise difference method. Twonodes simultaneously timestamped the arrival of an eventbeacon, then forwarded the timestamps to a sink node two hopsaway. At each hop, the timestamps are converted to the localtimescale. The synchronization error is the difference betweenthe timestamps at the sink node. For 100 synchronizations, theaverage error was 5.04µs, with a maximum of 9µs.

PC Network: We used RBS [7] to synchronize the PCnetwork. RBS synchronizes a set of nodes to the arrival ofa reference beacon. Participating nodes timestamp the arrivalof a message broadcast over the network, and by exchangingthese timestamps, neighboring nodes are able to maintainreference tables for timescale transformation. We used aseparate RBS transmitter to broadcast a reference beaconevery ten seconds over 100 iterations. Synchronization error,determined using the pairwise difference method, was as lowas 17.51µs on average, and 2050.16µs maximum. The worst-case error is significantly higher than reported in [7] becausethe OpenBrick-E wireless network interface controllers in ourexperimental setup are connected via USB, which has a defaultpolling frequency of 1 kHz.

Mote-PC Network: To synchronize a mote with a PC insoftware, we adopted the underlying methodology of ETA andapplied it to serial communication. On the mote, a timestampis taken upon transfer of a synchronization byte and insertedinto the outgoing message. On the PC, a timestamp is takenimmediately after the UART issues the interrupt, and the PCregards the difference between these two timestamps as thePC-mote offset. Serial communication bit rate between themote and PC is 57600 baud, which approximately amountsto a transfer time of 139 microseconds per byte. However,the UART will not issue an interrupt to the CPU until its 16-byte buffer nears capacity or a timeout occurs. Because thesynchronization message is six bytes, reception time in thiscase will consist of the transfer time of the entire message inaddition to the timeout time and the time it takes to transferthe date from the UART buffer into main memory by the CPU.

This time is compensated for by the receiver, and the clockoffset between the two devices is determined as the differencebetween the PC receive time and the mote transmit time.

GPIO pins on the mote and PC were connected to anoscilloscope, and set high upon timestamping. The resultingoutput signals were captured and measured. The test wasperformed over 100 synchronizations, and the resulting errorwas 7.32µs on average, and did not exceed 10µs.

HSN: We evaluated synchronization accuracy across theentire network using the pairwise difference method. Twomotes timestamped the arrival of an event beacon, and for-warded the timestamp to the network sink, via one mote andtwo PCs. RBS beacons were broadcast at four-second inter-vals, and therefore clock skew compensation was unnecessary,because synchronization error due to clock skew would beinsignificant compared with offset error. The average errorover the 3-hop network was 101.52µs, with a maximum of1709µs. The majority of this error is due to the polling delayfrom the USB wireless network controller. However, synchro-nization accuracy is still sufficient for our application. Theimplementation used in these experiments was bundled into atime synchronization service for sensor fusion applications.

VI. MULTIMODAL TARGET TRACKING

This section describes the tracking algorithm and the ap-proach for fusing the audio and video measurements basedon a sequential Bayesian estimation framework. We use fol-lowing notation: Superscript t denotes discrete time (t ∈ Z+),subscript k ∈ {1, ...,K} denotes the sensor index, where K isthe total number of sensors in the network, the target state attime t is denoted as x(t), and the sensor measurement at timet is denoted as z(t).

A. Sequential Bayesian Estimation

We use a sequential Bayesian estimation framework toestimate the target location x(t) at time t similar to theapproach presented in [15]. Sequential Bayesian estimationis a framework to estimate the probability distribution of thetarget state, p(x(t+1)|z(t+1)) using a Bayesian filter describedby

p(x(t+1)|z(t+1)) ∝ p(z(t+1)|x(t+1))·∫p(x(t+1)|x(t)) · p(x(t)|z(t))dx(t) (3)

where p(x(t)|z(t)) is the prior distribution from the previousstep, p(z(t+1)|x(t+1)) is the likelihood given the target state,and p(x(t+1)|x(t)) is the prediction for the target locationx(t+1) given the current location x(t) according to a targetmotion model. Since we are tracking moving vehicles it isreasonable to use a directional motion model based on thevehicle velocity. The directional motion model is describedby

x(t+1) = x(t) + v + U [−δ,+δ] (4)

where x(t) is the target location at time t, x(t+1) is thepredicted location, v is the target velocity, and U [−δ,+δ] is auniform random variable.

Since the sensor models (described later in subsection VI-B)are nonlinear, We use a nonparametric representation for theprobability distributions which are represented as discrete gridsin 2D space similar to [15]. For nonparametric representation,the integration term in equation (3) becomes a convolutionoperation between the motion kernel and the prior distribution.The resolution of the grid representation is a trade-off betweentracking resolution and computational capacity.

Centralized Bayesian Estimation: Since we use moteclass audio nodes that are resource-constrained, centralizedBayesian estimation is a reasonable approach due to thelimited computational resources. The likelihood function inequation (3) can be calculated either as a product or weightedsummation of the individual likelihood functions.

Hybrid Bayesian Estimation: In sensor fusion a big chal-lenge is to account for conflicting sensor measurements.When sensor conflict is very high, sensor fusion algorithmsproduce false or meaningless fusion results [11]. Reasons forsensor conflict are sensor locality, different sensor modalities,and sensor faults. If a sensor node is far from a target ofinterest then the measurements from that sensor will not beuseful. Different sensor modalities observe different physicalphenomenons. For example, audio and video sensors observesound sources and moving objects respectively. If a soundsource is stationary or a moving target is silent, the twomodalities will be in conflict. Also, different modalities areaffected by different types of background noise. Finally, poorcalibration, sudden change in local conditions can also causeconflicting sensor measurements.

Selecting and clustering the sensor nodes in different groupsbased on locality or modality can mitigate poor performancedue to sensor conflict. For example, clustering the nodes closeto the target location and fusing only the nodes in the clusterwould remove the conflict due to distant nodes.

The sensor network deployment in this paper is small andthe sensing region is comparable to the sensing ranges of theaudio and video sensors. For this reason, we do not use localitybased clustering. However, we want to evaluate the trackingperformance of the audio and video sensors. Hence, we devel-oped a hybrid Bayesian estimation framework by clusteringsensor nodes based on modalities and compare it with thecentralized approach. Figure 5 illustrates the framework. The

Sensor 1

Sensor 2

Sensor 3

Sensor N

Detection & Likelihoodfunction

Fusion (SBE)

Fusion (SBE)

Posterior Fusion

Detection & LikelihoodFunction`

Detection & Likelihoodfunction

Detection & LikelihoodFunction` Sensor likelihood

functions

Posterior distributions

Target state

Fig. 5. Hybrid Sequential Estimation

likelihood function from each of the sensor in a cluster isfused together using product or weighted sum. The combinedlikelihood is then used in equation (3) to calculate the posterior

distribution for that cluster. The posteriors from all the clustersare then combined together to estimate the target state.

For hybrid Bayesian estimation with audio-video clustering,the audio posterior is calculated using

paudio(x(t+1)|z(t+1)) ∝ paudio(z(t+1)|x(t+1))·∫p(x(t+1)|x(t)) · p(x(t)|z(t))dx(t)

while the video posterior is calculated as

pvideo(x(t+1)|z(t+1)) ∝ pvideo(z(t+1)|x(t+1))·∫p(x(t+1)|x(t)) · p(x(t)|z(t))dx(t)

The two posteriors are combined either as (product fusion)

p(x(t+1)|z(t+1)) ∝ paudio(z(t+1)|x(t+1))·pvideo(z(t+1)|x(t+1))

or (weighted-sum fusion)

p(x(t+1)|z(t+1)) ∝ α · paudio(z(t+1)|x(t+1))·(1− α) · pvideo(z(t+1)|x(t+1))

where α is a weighing factor.

B. Sensor Models

We use a nonparametric model for the audio sensors, whilea parametric mixture-of-Gaussian model for the video sensorsto mitigate the effect of sensor conflict in object detection.

Audio Sensor Model: The nonparametric DOA sensormodel for a single audio sensor is the piecewise linear in-terpolation of the audio detection function, i.e.

λ(θ) = wλ(θi−1) + (1− w)λ(θi), if θ ∈ [θi−1, θi]

where w = (θi − θ)/(θi − θi−1).Video Sensor Model: The video detection algorithm cap-

tures the angle of one or more moving objects. The detectionfunction from equation (2) can be parametrized as a mixture-of-Gaussian

λ(θ) =n∑i=1

aifi(θ)

where n is the number of components, fi(θ) is the proba-bility density function, and ai is the mixing proportion forcomponent i. Each component is a Gaussian density functionparametrized by µi and σ2

i .Likelihood Function: Next we present the computation of

the likelihood function for a single sensor given the sensormodel. The 2D search space is divided into N rectangularregions with center points (xi, yi) and side lengths (2δx, 2δy),i = 1, 2, ..., N as illustrated in Figure 6.

The angular interval subtended at the sensor node locationQk due to region i is [ϕ(k,i)

A , ϕ(k,i)B ]. This angular interval is

defined as

ϕ(k,i)A = ϕ

(k,i)0 + minj(∠P i0Q

kP ij ), j = 1, 2, 3, 4ϕ

(k,i)B = ϕ

(k,i)0 + maxj(∠P i0Q

kP ij ), j = 1, 2, 3, 4︸︷︷︸if sensor k is not in region i

0

0.2

0.4

0.6

0.8

1

P0(xi,yi)P1P2

P4P3

Qk (xk,yk)Rk

φA

φB

φ0

Fig. 6. Computing likelihood function for single sensor k and cell (xi, yi)

ϕ(k,i)A = 0

ϕ(k,i)B = 2π︸︷︷︸

if sensor k is in region i

where ϕk,i0 = ∠RkQkP i0, and the points Qk and Rk areQk = (xk, yk) and Rk = (xk + 1, yk). The sensor likelihoodfunction value for sensor k at region i is the average detectionfunction value in that region, i.e.

pk(z|x) = pk(xi, yi) =1

(ϕ(k,i)B − ϕ(k,i)

A )

∑ϕ

(k,i)A ≤θ≤ϕ(k,i)

B

λk(θ)

VII. EVALUATION

The deployment of the multi-modal target tracking systemis shown in Figure 7. We deploy 6 audio sensors and 3 videosensors on either side of a road. The objective of the system isto detect and track vehicles using both audio and video underthese conditions. Sensor localization and calibration for bothaudio and video sensors is required. In our experimental setup,we manually placed the sensor nodes at marked locations andorientations. The audio sensors were placed on 1 meter hightripods to minimize audio clutter near the ground.

We gathered audio and video detection data for a total dura-tion of 43 minutes. Table I presents the parameter values thatwe use in our tracking system. We ran our sensor fusion andtracking system online using centralized sequential Bayesianestimation based on the product of likelihood functions. Wealso collected all the audio and video detection data for offlineevaluation. This way we were able to experiment with differentfusion approaches on the same data set. We shortlisted 10vehicle tracks where there was only a single target in thesensing region. The average duration of tracks was 4.25 secwith 3.0 sec minimum and 5.5 sec maximum. The trackedvehicles were part of an uncontrolled experiment. The vehicleswere traveling on road at 20-30 mph speed. The groundtruth is estimated post-facto based on the video recordingby a separate camera. For evaluation of tracking accuracy,the center of mass of the vehicle is considered to be thetrue location. Sequential Bayesian estimation requires a priordistribution of the target state. We initialized the prior usinga simple detection algorithm based on audio measurements.If the maximum of the combined audio detection functions

Number of beams in audio beam-forming, Maudio

36

Number of angles in video detec-tion Mvideo

160

Sensing region (meters) 35× 20Cell size (meters) 0.5× 0.5Interval for uniform random vari-able in Equation 4 (δ)

1.2 v

TABLE IPARAMETERS USED IN EXPERIMENTAL SETUP

exceeds a threshold and is within the sensing region, weinitialize the prior distribution.

We experimented with eight different approaches. We usedaudio-only, video-only and audio-video sensor measurementsfor sensor fusion. For each of these data sets, the combinedlikelihood was computed either as the weighted-sum or prod-uct of individual sensor likelihood functions. For the audio-video data, we used centralized and hybrid fusion. The list ofdifferent approaches is.:

1) audio-only, weighted-sum (AS)2) video-only, weighted-sum (VS)3) audio-video, centralized, weighted-sum (AVCS)4) audio-video, hybrid, weighted-sum (AVHS)5) audio-only, likelihood product (AP)6) video-only, likelihood product (VP)7) audio-video, centralized, likelihood product (AVCP)8) audio-video, hybrid, likelihood product (AVHP)Figure 8 shows the tracking error for a representative vehicle

track. The tracking error for tracking using audio data is con-sistently lower than that for the video data. When we use bothaudio and video data, the tracking error is lower than either ofthose considered alone. Figure 9 shows the determinant of thecovariance of the target state for the same vehicle track. Thecovariance, which is an indicator of uncertainty in target stateis significantly lower for product fusion than weighted-sumfusion. In general, covariance for audio-only is higher thanvideo-only, while using both modalities lowers the uncertainty.

Figure 10 shows average tracking errors and Figure 11shows the determinant of the covariance for all ten vehicletracks for all target tracking approaches mentioned above. Au-dio and video modalities are able track vehicles successfully,though they suffer from poor performance in presence of highbackground noise and clutter. In general, audio sensors are ableto track vehicles with good accuracy, but they suffer from highuncertainty and poor sensing range. Video tracking is not veryrobust on multiple objects and noise. As expected, fusing thetwo modalities consistently gives better performance. Thereare some cases where audio tracking performance is betterthan fusion. This is due to poor performance of video tracking.

Fusion based on product of likelihood functions gives betterperformance but it is more vulnerable to sensor conflict anderrors in sensor calibration, etc. The weighted-sum approachis more robust to conflicts and sensor errors, but it suffersfrom high uncertainty. Centralized estimation framework con-sistently performed better than the hybrid framework.

Video node 3

Video node 1

Video node 2

1 2 3

4 5 6

Audio SensorsVideo Sensors

Sensor Fusion Center

36.5 ft

15 ft

92 ft

76 ft

Fig. 7. Experimental setup

Fig. 8. Tracking error (a) weighted sum, (b) product

Fig. 9. Tracking variance (a) weighted sum, (b) product

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

vehicle track

aver

age

trac

kin

g e

rro

r (m

)

AS

VS

AVHS

AVCS

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10

vehicle track

aver

age

trac

kin

g e

rro

r (m

)

AP

VP

AVHP

AVCP

Fig. 10. Tracking errors (a) weighted sum, (b) product

VIII. RELATED WORK

Audio Beamforming: An overview on beamforming and itsapplication for localization in sensor networks can be found in[5]. Beamforming methods have successfully been applied todetect single or even multiple sources in noisy and reverberant

environments [4], [1], [14].

Video Tracking: Many adaptive background-modelingmethods have been proposed. The work in [8] modeledeach pixel in a camera scene by an adaptive parametricmixture model of three Gaussian distributions. An adaptive

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10

vehicle track

det

erm

inan

t o

f co

vari

ance

AS

VS

AVHS

AVCS

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

vehicle track

det

erm

inan

t o

f co

vari

ance

AP

VP

AVHP

AVCP

Fig. 11. Determinant of covariance (a) weighted sum, (b) product

nonparametric Gaussian mixture model to address backgroundmodeling challenges is presented in [19]. Other techniquesusing high-level processing to assist the background modelingalso have been proposed [21], [12]

Time Synchronization: Time synchronization in sensor net-works has been studied extensively in the literature and severalprotocols have been proposed [7], [9], [16], [17], [13], [18].Mote-PC synchronization was achieved in [10] by connectingthe GPIO ports of a mote and IPAQ PDA.

Multimodal Tracking: Previous work in multimodal targettracking using audio-video data object localization and track-ing based on Kalman filtering [20] as well as particle filteringapproaches [3], [2].

IX. CONCLUSIONS

We have developed a multimodal tracking system using anHSN consisting of six mote audio nodes and 3 PC cameranodes. Our system employs a sequential Bayesian estimationframework which integrates audio beamforming with videoobject detection. Time synchronization across the HSN allowsthe fusion of the sensor measurements. We have deployedthe HSN and evaluated the performance by tracking movingvehicles in an uncontrolled urban environment. Our evaluationin this paper is limited to single targets and we have shownthat, in general, fusion of audio and video measurementsimproves the tracking performance. Currently, our system isnot robust to multiple acoustic sources or multiple movingobjects and this is the main direction of our future work. Asin all sensor network applications, scalability is an importantaspect that has to be addressed, and we plan to expandour HSN using additional mote class devices equipped withcameras.

Acknowledgement: This work is partially supported by AROMURI W911NF-06-1-0076.

REFERENCES

[1] S. T. Birchfield. A unifying framework for acoustic localization.In Proceedings of the 12th European Signal Processing Conference(EUSIPCO), Vienna, Austria, September 2004.

[2] V. Cevher, A. Sankaranarayanan, J. H. McClellan, and R. Chellappa.Target tracking using a joint acoustic video system. IEEE Transactionson Multimedia, 9(4):715–727, June 2007.

[3] N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilisticframework for multi-modal multi-person tracking. In IEEE Workshopon Multi-Object Tracking, 2003.

[4] J. Chen, L. Yip, J. Elson, H. Wang, D. Maniezzo, R. Hudson, K. Yao,and D. Estrin. Coherent acoustic array processing and localization onwireless sensor networks. In Proceedings of the IEEE, volume 91, pages1154–1162, August 2003.

[5] J. C. Chen, K. Yao, and R. E. Hudson. Acoustic source localization andbeamforming: theory and practice. In EURASIP Journal on AppliedSignal Processing, pages 359–370, April 2003.

[6] M. Ding, A. Terzis, I.-J. Wang, and D. Lucarelli. Multi-modal cali-bration of surveillance sensor networks. In Military CommunicationsConference, MILCOM 2006, 2006.

[7] J. Elson, L. Girod, and D. Estrin. Fine-grained network time synchro-nization using reference broadcasts. In Operating Systems Design andImplementation (OSDI), 2002.

[8] N. Friedman and S. Russell. Image segmentation in video sequences:A probabilistic approach. In Conference on Uncertainty in ArtificialIntelligence, 1997.

[9] S. Ganeriwal, R. Kumar, and M. B. Srivastava. Timing-sync protocolfor sensor networks. In ACM SenSys, 2003.

[10] L. Girod, V. Bychkovsky, J. Elson, and D. Estrin. Locating tiny sensorsin time and space: A case study. In ICCD, 2002.

[11] H. Y. Hau and R. L. Kashyap. On the robustness of Dempster’s ruleof combination. In IEEE International Workshop on Tools for ArtificialIntelligence, 1989.

[12] P. KaewTraKulPong and R. B. Jeremy. An improved adaptive back-ground mixture model for realtime tracking with shadow detection. InWorkshop on Advanced Video Based Surveillance Systems (AVBS), 2001.

[13] B. Kusy, P. Dutta, P. Levis, M. Maroti, A. Ledeczi, and D. Culler.Elapsed time on arrival: A simple and versatile primitive for time syn-chronization services. International Journal of Ad hoc and UbiquitousComputing, 2(1), January 2006.

[14] A. Ledeczi, A. Nadas, P. Volgyesi, G. Balogh, B. Kusy, J. Sallai, G. Pap,S. Dora, K. Molnar, M. Maroti, and G. Simon. Countersniper systemfor urban warfare. ACM Trans. Sensor Networks, 1(2), 2005.

[15] J. Liu, J. Reich, and F. Zhao. Collaborative in-network processing fortarget tracking. In EURASIP, Journal on Applied Signal Processing,2002.

[16] M. Maroti, B. Kusy, G. Simon, and A. Ledeczi. The flooding timesynchronization protocol. In ACM SenSys, 2004.

[17] K. Romer. Time synchronization in ad hoc networks. In ACMSymposium on Mobile Ad-Hoc Networking and Computing, 2001.

[18] J. Sallai, B. Kusy, A. Ledeczi, and P. Dutta. On the scalability ofrouting integrated time synchronization. In Workshop on Wireless SensorNetworks (EWSN), 2006.

[19] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. In IEEE Transactions on Pattern Analysis and MachineIntelligence, 2000.

[20] N. Strobel, S. Spors, and R. Rabenstein. Joint audio video objectlocalization and tracking. In IEEE Signal Processing Magazine, 2001.

[21] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Prin-ciples and practice of background maintenance. In IEEE InternationalConference on Computer Vision, 1999.

[22] P. Volgyesi, G. Balogh, A. Nadas, C. Nash, and A. Ledeczi. Shooterlocalization and weapon classification with soldier-wearable networkedsensors. In Mobisys, 2007.

[23] M. Yarvis, N. Kushalnagar, H. Singh, A. Rangarajan, Y. Liu, andS. Singh. Exploiting heterogeneity in sensor networks. In IEEEINFOCOM, 2005.

multi-modal target tracking using heterogeneous sensor ... · fig. 2. data-ﬂow diagram of the...

Documents