762 ieee transactions on parallel and …ieeeprojectsmadurai.com/base/ns2/secure continuou… ·...

Secure Continuous Aggregationin Wireless Sensor Networks

Lei Yu, Member, IEEE, Jianzhong Li, Member, IEEE, Siyao Cheng,

Shuguang Xiong, and Haiying Shen, Member, IEEE

Abstract—Continuous aggregation is usually required in many sensor applications to obtain the temporal variation information of

aggregates. However, in a hostile environment, the adversary could fabricate false temporal variation patterns of the aggregates by

manipulating a series of aggregation results through compromised nodes. Existing secure aggregation schemes conduct one

individual verification for each aggregation result, which could incur great accumulative communication cost and negative impact on

transmission scheduling for continuous aggregation. In this paper, we identify distinct design issues for protecting continuous in-

network aggregation and propose a novel scheme to detect false temporal variation patterns. Compared with the existing schemes,

our scheme greatly reduces the verification cost by checking only a small part of aggregation results to verify the correctness of the

temporal variation patterns in a time window. A sampling-based approach is used to check the aggregation results, which enables our

scheme independent of any particular in-network aggregation protocols as opposed to existing schemes. We also propose a series of

security mechanisms to protect the sampling process. Both theoretical analysis and simulations show the effectiveness and efficiency

of our scheme.

Index Terms—Wireless sensor network, network security, secure continuous aggregation, sampling

Ç

1 INTRODUCTION

IN applications of wireless sensor networks (WSNs), theaggregations of sensed data, such as sum, average, and

predicate count, are very important for the users to getsummarization information about the monitored area.Instead of collecting all sensor data [1], [2], [3] andcomputing aggregation results at the base station (BS), in-network aggregation allows sensor readings to be aggre-gated by intermediate nodes, which efficiently reduces thecommunication overhead. Many in-network aggregationschemes have been proposed [4], [5], [6], [7]. However,since WSNs are often deployed in an open and unattendedenvironment, an adversary could undetectably take controlof one or more sensor nodes and subvert correct in-networkaggregations by manipulating the partial aggregationresults or reporting arbitrary readings through compro-mised nodes.

In this paper, we consider the security of continuous in-network aggregation in WSNs. In many WSN applicationsfor environment monitoring, the users often need thetemporal variation information in a series of aggregationresults rather than an individual aggregation result. Thus,

continuous aggregation of sensed data is usually desired.For a continuous aggregation query, a time interval, calledepoch, is specified and the aggregation is evaluated in everyepoch. The duration of every epoch specifies the amount oftime sensor nodes wait before acquiring and transmittingeach successive sample. Continuous aggregation is notmerely for one-shot responses to sporadic queries. It helpsthe users to understand how the environment changes overtime and track real-time measurements for trend analysis.

Because of the importance of temporal variation informa-tion of aggregation results, we focus on the attack againstcontinuous in-network aggregation that the adversariesattempt to distort the real temporal variation pattern of theaggregate by disrupting a series of successive aggregationresults. Fig. 1 shows an example. The user is interested in aspecial variation pattern of average temperature shown inthe shadowed box and could make critical decisions whenthe pattern is observed. The adversary can modify aggrega-tion results in the time window to fabricate the variationpattern that actually does not appear, which can lead towrong decisions.

A number of secure aggregation schemes have beenproposed [8], [9], [10], [11]. SIA [8] addresses secureaggregation within the single aggregator network topology.A number of hierarchical secure aggregation schemes [9],[10], [11] are proposed for aggregation in tree networktopology in which each node computes an intermediateaggregation result accounting for all sensing data of nodesin the subtree rooted at it. All these schemes aim to protect asingle aggregation computation. Directly using theseschemes in a continuous aggregation results in individualverification for every aggregation result in every epoch,which will incur a great communication cost especiallyfor continuous aggregation having a long period or high

762 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 3, MARCH 2014

. L. Yu and H. Shen are with the Department of Electrical and ComputerEngineering, Clemson University, 313-B Riggs Hall, Clemson, SC 29634.E-mail: {leiy, shenh}@clemson.edu.

. J. Li and S. Cheng are with the Department of Computer Science andTechnology, Harbin Institute of Technology, PO Box 750#, Harbin150001, Heilongjiang, China. E-mail: [email protected], [email protected].

. S. Xiong is with Baidu Inc, Beijing, China, and the Department ofComputer Science and Technology, Harbin Institute of Technology, PO Box750#, Harbin 150001, Heilongjiang, China. E-mail: [email protected].

Manuscript received 3 Sept. 2012; revised 2 Jan. 2013; accepted 12 Feb. 2013;published online 27 Feb. 2013.Recommended for acceptance by K. Li.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2012-09-0790.Digital Object Identifier no. 10.1109/TPDS.2013.63.

1045-9219/14/$31.00 � 2014 IEEE Published by the IEEE Computer Society

frequency (i.e., small epoch). The additional communicationcaused by interactive procedures between the base stationand sensor nodes for verification in every epoch also has anegative impact on the efficiency of transmission schedul-ing for a continuous data aggregation [12]. Besides, theseschemes [8], [9], [10] also are tightly coupled with the treetopology and, thus, unable to work with various other in-network aggregation protocols [6], [7].

In this paper, we present an efficient scheme to detect falsetemporal variation patterns in a continuous aggregation. Ourscheme verifies the correctness of the observed temporalvariation pattern in a time window by checking only a smallpart of aggregation results termed representative points. Therepresentative points are selected to capture the temporalvariation pattern of the aggregate, as shown in Fig. 1.Compared with the existing secure aggregation schemes, ourscheme can considerably reduce the communication costthrough selective verifications of aggregation results.

In our scheme, the correctness of representative points ischecked by hypothesis testing techniques with samplesfrom the WSN. While providing nice security properties,the sampling-based approach only requires a part of nodesto be involved in the verification, and enables verificationnot to rely on any particular in-network aggregationprotocol. To protect the sampling procedure, verifiablerandom sampling is proposed to protect the legitimacy ofsampled nodes, and local authentication based on spatialcorrelation among sensor readings is proposed to protectthe validity of sample readings. As a result, our scheme caneffectively verify the temporal variation patterns forcontinuous aggregation, while being able to achieve lowadditional energy cost and work with various in-networkaggregation protocols [6], [7]. We evaluate our schemebased on extensive experiments using a real trace of sensorreadings. The experiment results show the efficiency andeffectiveness of our scheme.

The rest of the paper is organized as follows: We presentsystem models and design goals of our scheme in Section 3.We propose the details of our scheme in Section 4. Weevaluate the performance and security of our scheme inSection 5. We present simulation results in Section 6.Finally, we conclude this paper in Section 7.

2 RELATED WORK

Due to the importance of aggregation computation forWSN, secure aggregation has received great attention inrecent years. A lot of secure aggregation schemes have beenproposed [8], [9], [10], [11], [13], [14], [15], [16], [17].

Wagner [13] evaluates the resilience of several aggregationfunctions against malicious nodes’ contribution to the finalcomputation results, and proposes to improve the resilienceby truncation and trimming on the set of sensor readings aswell as using robust estimators to compute aggregation.Under one single-aggregator network model, Przydatek et al.[8] propose an aggregate-commit-prove framework SIA todetect false aggregation results. In SIA, the base stationgenerates a commitment to the collection of sensor readingsby Merkle hash tree, and the home server verifies the resultsthrough reliable random sampling achieved by data commit-ment and interactive proofs with the base station. Yu [15]propose to directly use sampling to compute approximateaggregation results with provable guarantees that can alwayscorrectly answer aggregation queries.

A number of hierarchical secure schemes [9], [10], [11],[14], [16] have been proposed for in-network aggregation ontree topology, where each node computes an intermediateaggregation result accounting for the sensor readings ofnodes in the subtree rooted at it. Hu and Evans [14] propose asecure aggregation scheme against one single maliciousnode in the network, in which each node checks theinconsistency of MACs from their children and grand-children. Garofalakis et al. [16] propose to combine crypto-graphic signatures and Flajolet-Martin sketch [18] to achieveverifiable count aggregation.

Several secure hierarchical aggregation schemes [9], [10],[11] follow an aggregation-commitment-attest framework.During the in-network aggregation, each node computesthe hash as commitment over the input of its aggregationcomputation, intermediate results, and data commitmentsfrom its children, and then sends the hash to its parent.Based on the commitments, interactive attest is performedbetween the base station and sensor nodes when aggrega-tion completes. Yang et al. [9] propose a secure hop-by-hopdata aggregation protocol SDAP. The tree topology ispartitioned into multiple logical subtree groups, and sensordata are aggregated in every subtree separately to reducethe trust on high-level nodes. The groups returning outlierresults are attested by checking the aggregation correctnessalong a random path. Chan et al. [10] propose a provablysecure hierarchical aggregation scheme SHIA. In the attestphase of SHIA, the final commitment at the base station isbroadcasted and each node checks that its own contributionwas added into the aggregation by recomputing the finalcommitment with necessary information disseminated fromits ancestor nodes. Frikken et al. introduce modifications ofSHIA which reduce original Oðdmax log2 nÞ communicationper node to Oðdmax lognÞ, where dmax is the maximumdegree of the aggregation tree and n is the number of nodes.Based on SHIA, Roy et al. [17] propose a scheme to verifythe histogram computation to securely estimate the median.

All these previous works address secure in-networkaggregation within a snapshot query, so their approachesconduct verification for each single aggregation result.Unlike them, our work focuses on continuous in-networkaggregation and aims to protect the temporal variationpatterns of aggregation results. To protect continuousaggregation, previous approaches would conduct indivi-dual verification in every epoch and, thus, can incur a

YU ET AL.: SECURE CONTINUOUS AGGREGATION IN WIRELESS SENSOR NETWORKS 763

Fig. 1. The fabrication of the temporal variation pattern in a continuousaggregation.

significant communication cost. In contrast, our approachonly selectively verifies a small part of aggregation resultsin a time window.

3 PROBLEM STATEMENT

3.1 Network and Query Model

We assume a large-scale multihop WSN with a set of sensornodes S ¼ fs1; . . . ; sNg and a trusted base station. The basestation knows the total number of nodes N ¼ jSj. All thenodes and the base station are loosely time synchronizedwith a secure time synchronization service [19], [20]. Eachnode has the same communication radius Rc.

We assume a continuous querying environment forWSNs. For a continuous aggregation query, the base stationinitially disseminates a query into the network, consisting ofthe epoch duration, the period of the aggregation query,and a nonce number nonce. The nonce number is acryptographically secure random number generated bythe base station and used only once to uniquely identifycurrent query and prevent replay of old messages. Theaggregation query period is divided into epochs. In eachepoch, each node calculates a partial aggregate with itscurrent sensor readings, and the base station obtains a finalaggregation result.

Since most physical quantities in practical environments,such as temperature and humidity, usually change con-tinuously, we assume that the duration of an epoch is smallenough to reflect the continuous variation of the measuredphysical quantities with respect to time. As a result, theaggregation results would exhibit continuous variations.Such continuity actually can be characterized as that thedifference between aggregation values at two successiveepochs is bounded, i.e.,

jAðtÞ �Aðtþ 1Þj � �; ð1Þ

where AðtÞ is the true aggregation result in epoch t. Here, �is determined by the characteristics of the observed physicalquantities and the length of epoch duration.

The user is interested in some temporal variation patternof the aggregation results that last for more than one epoch.As opposed to previous works [8], [9], [10], we assume theaggregation is performed over the network without specify-ing any particular in-network structure such as tree [5].

3.2 Security Assumptions

We assume that each sensor node has a unique identifierand shares a unique secret symmetric key with the basestation. By pairwise key establishment schemes [21], [22],each node shares a pairwise key with each of its directneighbors and two-hop neighbors. A broadcast authentica-tion protocol such as �TESLA [23] exists such that any nodecan authenticate a message from the base station. We alsoassume that WSNs have a short safe bootstrapping phaseright after network deployment [21], during which adver-saries cannot successfully compromise any nodes.

3.3 Attack Model and Security Goal

We assume that the adversary can compromise multiplenodes and obtain the security information embedded inthese nodes, but cannot compromise the base station which

is well secured. We assume the Byzantine fault modelwhere a compromised node is under the full control of theadversary and can misbehave in an arbitrary way. Multiplecompromised nodes can collude to attack.

We focus on the attacks against in-network continuousaggregation, which aim to make the base station to accept aseries of false aggregate results of which the temporalvariation pattern deviates from the real one in a noticeablescale. The ways to carry out these attacks include providingfalse sensor readings and manipulating partial results viacompromised nodes. However, no matter which way theattacks use, we can generally model the attacks as

AgðtÞ ¼ AðtÞ þDðtÞ ts � t � te; ð2Þ

where AgðtÞ is the aggregation result received by the basestation in epoch t, and DðtÞ is the deviation of theaggregation result in epoch t caused by the attack. Agð�Þ,Að�Þ, and Dð�Þ are regarded as time-variant functions.½ts; te� is the duration where the temporal variation patternof aggregation results is manipulated. We assume that theattack preserves the continuity in the fabricated aggrega-tion results, since if not it can be easily detected by theusers through checking (1). In other words, we havejAgðtÞ �Agðtþ 1Þj � �.

Our security goal is to protect the authenticity of thetemporal variation pattern observed by the users. Specifi-cally, for a series of aggregation results Ag ¼ ðAgðtÞ; Agðtþ1Þ; . . . ; Agðtþ lÞÞ in a continuous aggregation, we want toguarantee that if the base station accepts Ag, the temporalvariation pattern of Ag is close to the true pattern with ahigh probability.

Notations. We list below notations in this paper:

. u, v, w (in lower case) are sensor nodes.

. N is the total number of sensor nodes.

. Nu is the set of u’s neighbors in u’s communicationrange including itself.

. N2u is the set of u’s two-hop neighbors outside its

communication range.. Rc is the communication radius of sensor nodes.. Ku is u’s individual key shared between u and the BS.. MACðK;mÞ is the message authentication code of

message m generated with a symmetric key K.. ru;t is the sensor reading of u in epoch t.

4 SECURE CONTINUOUS AGGREGATION

4.1 Overview

During the period of a continuous aggregation query, eachsensor node caches lmax number of sensor readings thatcontribute to the aggregations in the latest lmax epochs. lmaxdetermines the maximum length of the time window inwhich the temporal variation pattern of the aggregationresults can be verified.

Once the users observe an interesting temporal variationpattern of the aggregate, they can verify its authenticity on-demand. However, in the circumstance that the adversary isinterested in suppressing the real appearance of aninteresting temporal variation pattern, the users cannotdecide when to conduct verification because they do notknow when the interesting pattern really appears. Thus,periodic verification is required. To this end, the period of


the aggregation query is divided into successive timewindows. Each time window consists of several successiveepochs. At the end of each time window, the temporalvariation pattern in this time window is verified.

Either in the on-demand verification or in the periodicverification, the BS selects some points from the series ofaggregation results in the time window to be verified, andchecks their correctness to detect any fabrication of temporalvariation patterns. Considering that the adversary canmanipulate only a small number of aggregation results suchas extreme points to tamper with the temporal variationpattern, it may be ineffective to check a set of randomlyselected points to detect forged patterns because the selectedpoints may not cover these manipulated points, whichcauses that the attack is not detected. Thus, to guaranteeeffective attack detection, the selected points should be ableto capture the temporal variation pattern in the time windowlike extreme points. We refer to these points as representativepoints and the epoch of a representative point as representa-tive epoch hereinafter. After the selection of representativepoints, the BS broadcasts a verification request, whichincludes the representative epochs, the sampling ratio %,and a nonce number noncev, to the WSN. Once receiving theverification request, each node decides whether to act as asampled node. Before the sampled nodes send to the BS theirsensor readings of every representative epoch, their neigh-boring nodes verify the correctness of sample data andauthenticate the sample messages.

With the sensor reading samples, the BS checks thecorrectness of the aggregation results of each representativeepoch by hypothesis testing. The general form of thehypothesis tests is

H0 : AðtÞ ¼ AgðtÞ versus Ha : AðtÞ 6¼ AgðtÞ: ð3Þ

If the aggregation results in all representative epochs areverified as correct, the temporal variation pattern in thetime window is assumed to be authentic.

In the rest of this paper, we suppose that the timewindow to be verified is from epoch t to epoch tþ l,denoted by ½t; tþ l�, and we always take the points at theboundary epochs t and tþ l as two representative points.Obviously, lþ 1 < lmax.

4.2 Representative Point Selection (RPS)

We first give the definition of representative point toformally characterize the requirement that is to capture thetemporal pattern of the whole aggregation result series.Fig. 2 shows an example.

Definition 1 (representative points). Let P ¼ fðei; AgðeiÞÞ j1 � i � p; e1 ¼ t < e2 < � � � < ep�1 < ep ¼ tþ lg be a set ofpoints in the time window ½t; tþ l�, where AgðeiÞ is theaggregation result in epoch ei. Let FP ð�Þ be the piecewise

linear function consisting of connected line segments, each of

which is between point ðei; AgðeiÞÞ and ðeiþ1; Agðeiþ1ÞÞ for

1 � i � p� 1. If FP is a best approximation of the series

of aggregation results Agð�Þ within ½t; tþ l� among all

possible FP 0 where P 0 ¼ fðe0i; Agðe0iÞÞ j 1 � i � p; e01 ¼ t <e02 < � � � < e0p�1 < e0p ¼ tþ lg, we say P captures the tem-

poral pattern of aggregation results and the points in P are

representative points in the time window ½t; tþ l�. Here, the

goodness of approximation is assessed by the approximation

error between FP ð�Þ and Agð�Þ, which is measured by their

euclidean distance

Eðt; tþ lÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXtþlk¼tfAgðkÞ � FP ðkÞg2

vuut :

4.2.1 Representative Point Selection

Given Definition 1, the RPS problem can be described as

follows: Given an integer pðp � 2Þ, find a set of points P ¼fðei; AgðeiÞÞ j 1 � i � p; e1 ¼ t < e2 < � � � < ep�1 < ep ¼ tþ lgsuch that the error of approximation of Agð�Þ by FP ð�Þ in

the time window ½t; tþ l� is minimized and jP j ¼ p.Let Fða;bÞðxÞða < bÞ be the linear function through point

ða;AgðaÞÞ and ðb; AgðbÞÞ. Iða; bÞ is the approximation error of

the aggregation results from epoch a to b by Fða;bÞðxÞ. Then,

we have

Fða;bÞðxÞ ¼AgðbÞ �AgðaÞ

b� a ðx� aÞ þAgðaÞ; ð4Þ

Iða; bÞ ¼Xbk¼afAgðkÞ � Fða;bÞðkÞg2: ð5Þ

Assuming t0 > t, we let Eðt; t0; pt0 Þ be the minimum

approximation error when pt0 ðpt0 � 2Þ number of points

between epoch t and t0ð> tÞ (including t and t0) are selected

as representative points for the time window ½t; t0�. Suppose

e01 ¼ t; e02; . . . ; e0pt0 �1; e0pt0¼ t0 are representative epochs, the

points of epochs e01; e02; . . . ; e0pt0 �1 must be an optimal

selection of pt0 � 1 points for approximating Agð�Þ in the

time window ½t; e0pt0 �1�. By the optimal substructure of the

problem, the following recursive formula is given to

compute Eðt; t0; pt0 Þ:

Eðt; t0; pt0 Þ ¼

mintþpt0 �2�k<t0

fEðt; k; pt0 � 1Þ þ Iðk; t0Þg;if 2 < pt0 < t0 � tþ 1;

0; if pt0 � t0 � tþ 1;Iðt; t0Þ; if pt0 ¼ 2:

8>>><>>>: ð6Þ

Based on (6), we propose a dynamic programming

algorithm called RPS algorithm to solve the RPS problem.

The pseudocode of RPS algorithm is shown in Appendix A

(All appendices are in the supplementary file, which can be

found on the Computer Society Digital Library at http://

doi.ieeecomputersociety.org/10.1109/TPDS.2013.63). The

algorithm takes OðlpÞ space and Oðl3pÞ time, considering

the OðlÞ time of the computation of (5).


Fig. 2. Definition of representative points.

4.2.2 RPS with Prespecified Points (RPS-P)

With the knowledge of RPS algorithm and the ability ofpredicting the real temporal variation pattern of theaggregate, the adversary may try to forge a series ofaggregation results of which the selected representativepoints have aggregation values equal or close to the realones. If such attempt is successful, the check of representa-tive points will not detect the fabrication of the temporalvariation. Fig. 3 shows an example of fabricated series ofaggregation results and the representative points selected byRPS algorithm over the fabricated series. The aggregationvalues of representative points are the same as the realaggregation results, which causes that the false patternbetween epoch 0 and 9 cannot be detected. Considering suchpossibility, the randomness is introduced to make the outputof the selection algorithm unpredictable. To this end, eachdata point in ðt; tþ lÞ (not including epoch t and tþ l) isprespecified as a representative point with a probability of qin our scheme. Then, the remaining number of representa-tive points including the ones at two boundary epochs t andtþ l are selected to minimize the approximation error. Onthe other hand, some points such as the maximum andminimum aggregation results, which describe the significantcharacteristics of the temporal variation pattern, should bealways prespecified as representative points.

Therefore, we consider the problem of RPS withprespecified points: given a set of prespecified points eP ð eP �fðt; AgðtÞÞ; ðtþ l; Agðtþ lÞÞgÞ and an integer pðp � j eP jÞ, finda set of representative points P such that the approximationerror of Agð�Þ by FP ð�Þ in ½t; tþ l� is minimized while eP Pand jP j ¼ p. We can see that RPS-P problem becomesRPS when eP ¼ fðt; AgðtÞÞ; ðtþ l; Agðtþ lÞÞg.

L e t eP ¼ fðti; AgðtiÞÞ j 1 � i � epðep � 2Þ; t1 ¼ t < t2 < � � �< tep�1

< tep ¼ tþ lg. Assuming i � 2, we let eEðt1; ti; piÞ bethe minimum approximation error when piðpi � iÞ numberof representative points, which includes prespecified onesfðtj; AgðtjÞÞ j 1 � j � ig, are selected for the approximationin the time window ½t1; ti�. Similarly, the following recursiveformula is given to compute eEðt1; ti; piÞ:eEðt1; ti; piÞ

¼

mini�1�k�ci

f eEðt1; ti�1; kÞ þ Eðti�1; ti; pi � kþ 1Þg;

if i < pi < ti � t1 þ 1; i > 2;

0; if pi � ti � t1 þ 1; i > 2;Xi�1

j¼1

Iðtj; tjþ1Þ; if pi ¼ i > 2;

Eðt1; ti; piÞ; if i ¼ 2;

8>>>>>>>>>><>>>>>>>>>>:

ð7Þ

where ci ¼ minfti�1 � t1 þ 1; pi � 1g. Eðti�1; ti; pi � kþ 1Þand Eðt1; ti; piÞ are computed by (6).

Based on (7), a dynamic programming algorithm is alsoproposed to solve RPS-P problem, referred to as RPS-Palgorithm. The pseudocode is shown in Appendix B,available in the online supplemental material. Consideringthe function call RPS() to RPS algorithm takes Oðl3pÞ timeand OðlpÞ space, RPS-P algorithm takes Oðl3p3epÞ time andOðeppþ lpÞ space.

4.2.3 The Number of Representative Points

Selecting more representative points can further enhance thecapability of our scheme to detect forged temporal variationpattern because a larger number of representative points canbetter capture the variation pattern of aggregation resultsand have a higher probability to cover the manipulatedperiod. However, since each representative point needs tobe verified by collecting sensor reading samples in thecorresponding representative epoch from the WSN, morerepresentative points mean higher communication cost.Therefore, there is a tradeoff between detecting capabilityand communication cost. Sections 4.2.1 and 4.2.2 actuallyaddress the optimal representative point selection to mini-mize the approximation error with a given budget oncommunication cost, i.e., a given number of representativepoints. On the other hand, the users would need to decide atleast how many representative points are required to achievethe desired detecting capability of the scheme. Thus, here weconsider the problem of minimizing the number of repre-sentative points given a certain degree of the approximationerror that the users can tolerate.

The problem can be formally described as follows: Given aset of prespecified points eP and the maximum approximationerror that the users can tolerate, denoted by IE, find a set ofrepresentative points P such that the approximation error ofAgð�Þ by FP ð�Þ in ½t; tþ l� is not greater than IE and jP j isminimized. Based on RPS-P algorithm, the solution for thisproblem is simple. Let p ¼ lþ 1. According to (7), RPS-Palgorithm will generate the result of eEðt; tþ l; kÞ for allep � k � lþ 1. Because it is obvious that eEðt; tþ l; kÞdecreasesask increases, we can simply conduct a linear or binary searchto find the first entry with eEðt; tþ l; kÞ � IE, where k is theminimum number of representative points for the problem.By an auxiliary matrix s½i; j�, we can obtain the correspondingrepresentative points in the same way as RPS-P algorithmexcept pl is initialized with the obtained minimum k in line 22.The computation takes Oðl6epÞ time and Oðeplþ l2Þ space.

4.3 Secure Sampling

After selecting the representative points, the BS checks thecorrectness of each representative point by sampling andhypothesis testing shown in (3). In our scheme, the WSN isuniformly sampled. Sampling provides nice security prop-erties that the integrity of each sample can be authenticatedby a single node with its individual key, and maliciousnodes cannot fabricate or change the reported samples ofhonest nodes. However, the adversary can provide forgedsamples through the compromise nodes to make the BSaccept the null hypothesis in (3). Besides, if without anyprotection, the malicious nodes can easily and alwayspretend to be sampled to provide false samples while not


Fig. 3. An example of attack against RPS algorithm.

being detected. Therefore, we consider the followingproblems in the sampling process: first, how the BS verifiesthe legitimacy of sampled nodes; second, how to detect falsesamples provided by the malicious nodes.

4.3.1 Verifiable Random Sampling

In the verifiable random sampling, each node decideswhether it is sampled by computing a cryptographicallysecure pseudorandom function h. h uniformly maps theinput values into the range of ½0; 1Þ. Specifically, beinginformed of a sampling ratio % broadcasted from the BS,each node, say v, checks the inequality:

hKvðnonce j noncevÞ � %; ð8Þ

where nonce and noncev are nonce numbers that aredisseminated within each aggregation query and verifica-tion request, respectively. If Inequality (8) holds, v sends itssample Rv ¼ ðrv;e1

; rv;e2; . . . ; rv;epÞ, i.e., its sensor readings in

the representative epochs, to the BS. In this way, whether anode is sampled is decided by the node individual key andtwo nonce numbers which are known by the BS. Since eachtime of verification different nonce j noncev is used, thenodes are randomly selected to be sampled for everyverification. Also, a malicious node cannot arbitrarily claimto be sampled, since the BS can verify the legitimacy of v assampled node by checking whether (8) holds.

The actual number of samples returned by this samplingapproach is random. Thus, the determination of samplingratio % needs to provide a probabilistic guarantee to achievethe target sample size of at least mt. Here, Theorem 1 isgiven to decide %. The proofs of all theorems in this paperare given in the supplementary file, available online.

Theorem 1. Given a target sample size of at least mt, toguarantee the final sample size m � mt with a probability of atleast 1� �sð�s < 0:5Þ, the sampling ratio % is at least %�:

%� ¼ 0:5

N � c2 þN2

�Nc2 þ 2ðmt � 0:5ÞN

þ ðN2c4 þ 4N2c2ðmt � 0:5Þ � 4ðmt � 0:5Þ2Nc2Þ0:5�;

where c ¼ ��1ð�sÞ.

4.3.2 Local Sample Authentication

In many applications like environment monitoring, themeasurements from multiple sensors in the same space areoften highly correlated and exhibit a high similarity onstatistical distributions. This fact has been widely exploitedin efficient information extraction [24], routing protocols[25], scheduling algorithms [26], and attack detection [27].We also exploit such fact to propose a local sampleauthentication mechanism to prevent the malicioussampled node from arbitrarily providing false samples.

In the local sample authentication, each sampled node, sayv, broadcasts its sample Rv to its neighbors to obtainauthentication from its neighbors before sending Rv tothe BS. Each neighbor u verifies the validity of Rv. If theverification is successful, u sends to v the message authenti-cation code of Rv computed by u’s key used for theauthentication of v’s sample.

Local authentication key setup. To derive keys for the localsample authentication, each node u is loaded with a seedkey Ks

u before deployment. The BS holds the seed keys of allnodes. During the safe bootstrapping phase, u discovers itsone-hop neighbors and computes a key by Ka

u;v ¼ HðKsujvÞ

for each neighbor v, where H is a secure cryptographic hashfunction. Then, u erases Ks

u. The key Kau;v is stored and used

by u to authenticate samples from v. Since each authentica-tion keys is bound with a neighbor pair, the adversarycannot use it to authenticate samples from arbitrary nodesexcept for the corresponding neighboring node. The erasureof seed keys prevents the adversaries from deriving all theauthentication keys.

Local verification and authentication. Once receiving v’ssample Rv ¼ ðrv;e1

; rv;e2; . . . ; rv;epÞ, each neighbor u collects

necessary information from its neighborhood and verifiesthe validity of Rv by checking the following two conditions:

�u;v > �uT ; ð9aÞ

�v 62 OutlierDetectðf�w j w 2 eNu; eNu NugÞ; ð9bÞ

where �u;v is the Pearson correlation coefficient betweenRu ¼ ðru;e1

; ru;e2; . . . ; ru;epÞ and Rv, �

uT is a threshold deter-

mined by node u or prespecified by the user. �u;v iscomputed by

�u;v ¼ �ðRu;RvÞ ¼Pp

i¼1ðru;ei � �uÞðrv;ei � �vÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPpi¼1ðru;ei � �uÞ

2q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPp

i¼1ðrv;ei � �vÞ2

q ;

where �xðx ¼ u; v; wÞ is the average of Rx denoting sensorreadings of node x in the representative epochs. Outlier-Detect(S) is a function call to some outlier detection methodto identify and return outliers in set S.

The condition (9a) is used to detect possible abnormalreductions of spatial correlation. Considering that realsensor readings from two proximate nodes are often highlycorrelated and hold similar temporal variation patterns, thefabricated data from a malicious node are very likely tohave a low correlation coefficient with the real data fromneighboring nodes.

Generally, �ðRu;RvÞ depends on the distance dðu; vÞbetween u and v. In Geostatistics, a covariance function �ð�Þis used to model the relation between �ðRu;RvÞ and dðu; vÞ[28]. The covariance function is assumed to be nonnegativeand decrease monotonically with increasing distance d. Withthe knowledge of the covariance model for observedphysical qualities, the users can decide the value of �uT . Forexample, given Power Exponential model �ðdÞ ¼ e�d=�1ð�1 > 0Þ,since u is within the communication range of v, we havedðu; vÞ � Rc < �Rcð1 < � � 2Þ, then �uT can be set as �ð�RcÞaccording to that �ðdÞ decreases monotonically with increas-ing d. The users can determine � according to the spatialcorrelation model of sensor readings in the monitoredarea, with following the rule that � should be large enoughsuch that the condition (9a) is true for real sensor readingswith a high probability, and also should be as small aspossible to efficiently filter false sensing data.

If the users cannot prior estimate and prespecify �uT bythe knowledge of the covariance model before networkdeployment, we need to determine it during operation. We


propose to estimate �uT autonomously by a model-freemethod. The node u randomly selects a subset of nodesfrom its two-hop neighbors N2

u , denoted by eN2u , and collets

their sensor readings in the representative epochs. For eachnode w0 2 eN2

u , u computes the correlation coefficient �u;w0 ,and sets �uT ¼MEDIANðf�u;w0 j w0 2 eN2

ugÞ. The medianvalue is used to replace average to defend against theattack of deflating �uT by malicious samples, since median ismore robust than average [13]. Because u usually has alonger distance to its any two-hop neighbors than to its one-hop neighbors, we have �u;v > �uT .

The condition (9b) exploits the amplitude similarity ofsensor reading series in the neighborhood. Besides sharingthe similar temporal pattern, two series of sensor readingsRu and Rv from the proximate neighborhood do not havetoo much deviation on their amplitude scales �u and �v. Tocheck the condition, u first collects the means of sensorreadings in the representative epochs from a set ofrandomly selected nodes in Nu, denoted by eNu. For asparse network, u may also collect the means of its two-hopneighbors. Then, the outlier detection technique is used tocheck the abnormality of �v, which achieves statisticallyrobustness and effectiveness. Considering that there may bemultiple malicious neighbors to provide forged data, weuse Rosner’s test [29] for outlier detection. It is a general-ization of Grubbs’ test [30] for multiple outliers. To use it, anupper limit Ur must be specified on the number of potentialoutliers present. In our scheme, Ur depends on j eNuj and Urcan be set to at most j eNuj=2. In this paper, a significancelevel � ¼ 0:05 is used for Rosner’s test.

Once u finds both (9a) and (9b) hold for Rv, u computesMACðKa

u;v; RvÞ with its local authentication key Kau;v and

sends MACðKau;v; RvÞ to v. Otherwise, u ignores the

authentication request. A pseudocode of the entire proce-dure of local authentication is shown in Appendix D,available in the online supplemental material, including theprocedure of Rosner’s test.

4.3.3 Sample Message Transmission

After v collects cðc � 1Þ number of MACs from its neighborsfui j 1 � i � cg, v transmits to the BS its sample message:

Sv ¼ fRv; ðv; u1; . . . ; ucÞ; XMACg;

where XMAC ¼ MACðKv;RvÞ � MACðKau1;v; RvÞ . . . �

MACðKauc;v; RvÞ. The XOR of MACs reduces the commu-

nication cost, which has been proved to be secure [31]. Here,c is a security threshold prespecified by the users.

4.4 Aggregation Verification

Once broadcasting the verification request, the BS waits forsome time tw to ensure the arrivals of all samples.Considering the network delivery time of the verificationrequests and sample messages, tw should be at least twiceof the message delivery time from the network boundaryto the BS plus the time for the local sample authentication.According to the procedure of local sample authentication,the time required to complete it consists of the time of one-hop broadcast from a sample node and two-hop broadcastfrom each of its neighbor nodes, and also the time foreach neighbor to collect sensor readings in its two-hopneighborhood and for the sampled node to collect MACs

from its one-hop neighbors. These times can be easilyestimated and accordingly the time for the local sampleauthentication can be estimated. When time expires, the BSfirst checks the validity of every arrived sample and thesample size, and then verifies the aggregation results inrepresentative epochs.

4.4.1 Sample Message Verification

For every sample message, say Sv claimed from node v, theBS verifies its validity in two steps. First, the BS verifies thelegitimacy of the claimed sampled node v by checkingwhether Inequality (8) holds because the BS knows h, nonce,noncev, and Kv. Then, the BS verifies XMAC in the samplemessage. Since the BS holds the seed key Ks

u of any node u,it can generate u’s authentication key Ka

u;v. The BS generatesKaui;v

for each node ui in the ID list ðu1; . . . ; uT Þ in Sv,recomputes XMAC, and compare it with the one in Sv forequality. If the verification in any step above fails, the BSdrops Sv and raises an alarm. Otherwise, the BS accepts Sv.In this way, all invalid sample messages are dropped.

During the local sample authentication, a false samplemay pass the local verification and be successfully authenti-cated by c neighbors due to sufficient number of compro-mised nodes in the same neighborhood. However, it isexpensive for the adversary to provide a large portion offalse samples because of the verifiable random sampling andlocal sample authentication. Thus, we assume the number offalse samples is relatively small to the total sample size andwe can use Rosner’s test to detect outlying sensor readings ineach representative epoch. The sampled nodes from whichoutlying sensor readings are detected are labeled as outlyingnodes and the hypothesis testing is conducted over thesamples excluding those from outlying nodes.

4.4.2 Hypothesis Testing for Aggregation Verification

Let m be the final sample size after the sample messageverification. The set of nodes where the final samples arefrom is denoted by fsi j 1 � i � mg. Here, we discuss theverification for the count, average, and sum queries in arepresentative epoch k, respectively.

Predicate count aggregate. The predicate count query isused to determine the total number of nodes whosesensor readings have some property in the network (e.g.,number of sensors sensing temperature > 30�C). LetAcð�ÞðkÞ and Agcð�ÞðkÞ be the true aggregation result andthe in-network aggregation result, respectively, for count-ing the nodes whose sensor readings in epoch k satisfysome predicate �. The predicate count aggregation isverified by the following hypothesis testing with regardto probability distribution:

H0 : p1 ¼Agcð�ÞðkÞ

N; p2 ¼ 1�

Agcð�ÞðkÞN

;

Ha : p1 6¼Agcð�ÞðkÞ

N; p2 6¼ 1�

Agcð�ÞðkÞN

;

ð10Þ

where p1 is the probability of a sensor node satisfying � andp2 ¼ 1� p1.

According to the hypothesis testing theory, the 2

goodness-of-fit test can be used for (10). m1 be the numberof samples satisfying � in epoch k and m2 ¼ m�m1. Thetest statistic is computed by


X2 ¼m1 �mAgcð�ÞðkÞ

N

� �2

mAgcð�ÞðkÞ

N

þm2 �m 1� Agcð�ÞðkÞ

N

� �� 2

m 1� Agcð�ÞðkÞN

� � : ð11Þ

WhenH0 is true,X2 approximately follows2-distributionwith one degree of freedom. Let 2

�ð1Þ be the upper100� percentage point of 2-distribution with one degreeof freedom. Given a significance level �, if X2 � 2

�ð1Þ or thep-value ofX2 is smaller than�,H0 is rejected. Otherwise,H0 isnot rejected and the BS accepts Agcð�ÞðkÞ as true.

For the predicate count aggregation in epoch k, the use of2 goodness-of-fit test assumes adequate expected numbersin each cell, i.e., mp1 � 10 and mp2 � 10. Then, the hypoth-esis testing would be inefficient in the case that p1 p2 (e.g.,p2 is small) or p1 � p2 (e.g., p1 is small) because a largesample size is required to achieve mp2 � 10 or mp1 � 10,respectively. Suppose p2 ¼ 0:01, the sample size should be1,000 at least. To address this problem, we use an alternativeverification approach that verifies whether the minority istrue, i.e., either checking p2 if p1 p2 or checking p1 ifp1 � p2. This is because that if the result is not true, it mostlikely matters only when the attacks cause the number in acategory to decrease too much to a small degree. Therefore,if p1 � p2ðp1 p2Þ and p1ðp2Þ is not true, we assume thatAcð�ÞðN �Acð�ÞÞ is notably larger than Agcð�ÞðN �Agcð�ÞÞ.

Suppose Mg ¼ minfAgcð�ÞðkÞ; N �Agcð�ÞðkÞg and M isthe true number in the cell corresponding to Mg. Weconduct sampling with ratio % against the nodes within thesmaller cell. Then, the number of received samples mfollows a Binomial distribution BðM;%Þ. Given m, ifm > Mg, then the aggregation result is rejected. Otherwise,we can use one-tail binomial test to verify whether Mg iscorrect. Let X be a random variable following Binomialdistribution BðMg; %Þ. Given a significant level �B,we compute mU ¼ minfk j PrðX � kÞ � �Bg, If m > mU ,the aggregation result is rejected.

Average aggregate. Let AaðkÞ and AgaðkÞ be the trueaggregation result and the in-network aggregation result ofaverage query in epoch k, respectively. According to thecentral limit theorem, the sample mean is approximatelynormally distributed for large sample sizes. Thus, t-test isused to test the hypothesis:

H0 : AaðkÞ ¼ AgaðkÞ versus Ha : AaðkÞ 6¼ AgaðkÞ: ð12Þ

With the sample mean bAaðkÞ ¼ 1m

Pmi¼1 rsi;k and the

sample variance �2 ¼ 1m�1

Pmi¼1ðrsi;k � bAaðkÞÞ2, the test

statistic is computed by

T ¼bAaðkÞ �AgaðkÞ

�=ffiffiffiffiffimp : ð13Þ

When H0 is true, T follows t-distribution with m� 1degrees of freedom. Let t�

2ðm� 1Þ be the upper 100�=2

percentage point of t-distribution with m� 1 degrees offreedom. Given a significance level �, if jT j � t�

2ðm� 1Þ or

the p-value of T is smaller than �, H0 is rejected. Otherwise,the BS accepts AgaðkÞ as true.

Sum aggregate. For the sum query, the sum aggregationresult AgsðkÞ is checked by verifying the average aggrega-tion AgsðkÞ

N .

5 ANALYSIS AND PARAMETER DETERMINATION

In this section, we analyze the effectiveness of our scheme fordetecting false temporal variation patterns of aggregates,and discuss how to determine the parameters includingsampling ratio and probability of prespecifying representa-tive point used in our scheme. The analysis of our scheme’soverhead is given in Appendix F, available in the onlinesupplemental material.

5.1 Effectiveness Analysis

5.1.1 Verification Effectiveness of Representative Point

Our aggregation verification scheme provides a statisticalsecurity guarantee for the aggregation result in eachrepresentative epoch. This is because two kinds of errorscould occur in the hypothesis testing: Type I (false positive) ifH0 is rejected when it is actually true; Type II (false negative) ifH0 is not rejected when it’s actually false. Type I error causesthe BS to reject the true aggregation results and the Type IIerror causes the BS to accept the false aggregation results.

Theorem 2. Let m be the number of samples and � be thesignificance level used for the hypothesis testing, we have:

. If the in-network aggregation result AgðkÞ is true, thenthe BS accepts it with probability at least 1� �.

. Given a constant value �, if jAgðkÞ �AðkÞj > �, then

- With 2 goodness-of-fit test for the predicatecount aggregation, the BS rejects Agcð�ÞðkÞ withprobability at least

1� �1

ca þm

�

N

� � ��

1

c� a þm

�

N

� � � �;

ð14Þ

where

pr ¼Acð�ÞðkÞN

; p1 ¼Agcð�ÞðkÞ

N;

a ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2�ð1Þmp1ð1� p1Þ

q; c ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimprð1� prÞ

p;

and � is the standard normal cumulative distribu-tion function. Here, we assume prð1� prÞ 6¼ 0.

Through verifying the smaller number of twocells with one-tail binomial test and sampling ratio%, the BS rejects Agcð�ÞðkÞ with probability at least

XMgþ�

i¼mUþ1

Mgþ�

i

� �%ið1� %ÞMgþ��i: ð15Þ

- For the average aggregation over the wholenetwork, the BS rejects AgaðkÞ with probabilityat least

1� � t�2ðm� 1Þ þ �

=ffiffiffiffiffimp

� �

� � �t�2ðm� 1Þ þ �

=ffiffiffiffiffimp

� �;

ð16Þ

where 2 is the population variance of sensorreadings in epoch k.


5.1.2 Verification Effectiveness of Aggregation Variation

Pattern

Successful verification of the representative points meansthat the temporal variation pattern observed by the usersshares these common points with the real one. This at leastindicates that the variation pattern given by linear piecewisefunction Fp is embedded in the real variation pattern ofaggregate and provides users the pattern information in acertain extent. The verification of representative pointsforces the adversary to distort the variation pattern bymanipulating a series of aggregation results between tworepresentative points, i.e., in a time window where noepochs are selected as representative epochs. However, wecan show that it is difficult for the adversary to generate suchan undetected series including true representative points.

Except for prespecified representative points, RPS-Palgorithm selects the representative points at the positionswhere the approximation error by piecewise linear functionis minimized. Hence, for the variations of the aggregationresults, our algorithm most likely captures the turningpoints in them, which is inevitable especially when theadversary wants to fabricate some change trends of interestto users. The way for the adversary to avoid turning pointsis to generate linear trends with two true end points, sincethe algorithm tends to select two end points on the line. Butrandomness introduced in the selection of optimal repre-sentative points can help to prevent such attempt becauseevery point in the series is prespecified as representativepoint with probability q.

We assume that the adversary manipulates a series ofaggregation results of length lf , denoted by Agð1Þ;Agð2Þ; . . . ; AgðlfÞ, which deviate from the true values by atleast Dð1Þ; Dð2Þ; . . . ; DðlfÞ, respectively. Each data point ofaggregation results is verified with a probability of q. Ifbeing verified, AgðiÞ is rejected with probability at leastPrrejectðDðiÞÞ, which is given by Theorem 2. Therefore, theprobability of AgðiÞ being detected is at least qPrrejectðDðiÞÞ.Then, since there are lf number of aggregation results, thefabricated series can be detected with probability at least

1�Ylfi¼1

ð1� qPrrejectðDðiÞÞÞ: ð17Þ

We can see it increases with lf , q, and DðiÞ.Suppose that Dmax is the maximum deviation of the

fabricated series from the true values, i.e., Dmax ¼maxtet¼tsDðtÞ. Given successful verification of representativepoints, the undetected fabricated series should return totrue values at the start and end positions, which mean thatthe deviation decreases to zero. According to the continuityassumptions for the real and fabricated aggregation resultsin Section 3, the difference between aggregation values attwo successive epochs is bounded by �, then the maximumdecreasing speed of the difference between the real andfabricated aggregation results is 2�. Then, we have

lf � 2Dmax

2�¼ Dmax

�: ð18Þ

We note that Dmax is infinity norm distance between twoseries and characterizes the difference between their

variation patterns. Formula (18) indicates that the adversaryneeds to fabricate a longer series of aggregation results toachieve larger distortion of variation pattern, which thenincreases the detection probability due to larger lf .

5.2 Parameter Determination

5.2.1 Sampling Ratio

A larger sample size can reduce the probabilities of Type Iand Type II error in the hypothesis testing. Since theprobability of Type I error is limited by the significancelevel �, the determination of the sample size aims atlimiting the probability of Type II error. Formally, given themaximum deviation � which the user can tolerate, theproblem is how to determine the sample size such that ifjAgðkÞ �AðkÞj > � for some k 2 fe1; e2; . . . ; epg, the prob-ability of H0 being rejected is at least 1� �.

For the verification against the smaller number in cellswhen p1 � p2 or p1 p2, the desired sampling ratio %should satisfy that the probability of (15) is at least 1� �. Wecan use Theorem 1 to compute %, by letting mt ¼ mU þ 1,N ¼Mgþ�, and �s ¼ �.

For the verification using 2 goodness-of-fit test, accord-ing to Theorem 2, the desired sample size should satisfy

�1

ca þm

�

N

� � � �

1

c�a þm

�

N

� � � � ð19Þ

for the predicate count aggregation in epoch k.For the average aggregation in epoch k, the desired

sample size should satisfy

� t�2ðm� 1Þ þ �

=ffiffiffiffiffimp

� �� t�

2ðm� 1Þ þ �

=ffiffiffiffiffimp

� �� :

ð20Þ

If c and in (19) and (20) are known, we can use abinary search in ½1; N� to find the least sample size in epochk, denoted by mðkÞ, required to satisfy the inequalities (19)and (20). However, c and are most likely unknown inpractice. To address this problem, the BS could firstbroadcasts the verification request with an initial samplingratio %1 to collect samples for estimating c and in eachrepresentative epoch k. Then, the BS computes mðkÞ. Thenumber of received samples m may be smaller than mðkÞbecause of insufficient initial sampling ratio %1, packet loss,and local verification filtering. In these cases, a newsampling ratio %2ðkÞ is computed for epoch k according toTheorem 1 with mt ¼ mðkÞ. The BS broadcasts f%2ðkÞ j m <mðkÞ; k 2 fe1; e2; . . . ; epgg to the network. Any sensor node,which satisfies Inequality (8) where % ¼ %2ðkÞ, transmits itsreading in epoch k to the BS if the node was not sampled atthe time of the previous round of sampling.

5.2.2 Probability of Prespecifying Representative Point

Assuming that the adversary attempts to fabricate a seriesof aggregation results that deviate from the real ones by atleast � (� is the maximum deviation that the user cantolerate). According to (17), the detection probability is atleast 1� ð1� qPrrejectð�ÞÞlf . The adversary may reduce thedetection probability by decreasing lf , which is the length offabricated aggregation series. However, (17) also indicatesthat we can ensure the detection probability by increasing q,


the probability of a data point being specified as represen-tative point, to offset the impact of decreasing lf . Accord-ingly, we can determine the minimum value of q.

The sampling ratio given in Section 5.2.1 ensuresPrrejectð�Þ � 1� �. Then, we have 1� ð1� qPrrejectð�ÞÞlf �1� ð1� qð1� �ÞÞlf . According to (18), the minimum detec-tion probability that the adversary can achieve is

1� ð1� qð1� �ÞÞ�� : ð21Þ

We choose q to ensure a desired lower bound � for thedetection probability by

1� ð1� qð1� �ÞÞ�� : ð22Þ

Then, we have

q � 1� ð1� �Þ��

1� � : ð23Þ

As we can see, q increases with �, which is the bound ofthe difference between aggregation values at two successiveepochs for characterizing continuity. Larger � enables theadversary to create the same degree of pattern distortionwithin a shorter length series of aggregation results.Therefore, a larger q is required to ensure representativepoint selection from the forged series and, thus, to preservethe detection probability. With prior knowledge aboutmeasured physical quantity, the users can estimate bound� and compute the corresponding probability q by (23).

6 EVALUATION

In this section, we evaluate the performance of local sampleauthentication and aggregation verification by simulations.We use Matlab to perform the simulations. To evaluate ourlocal sample authentication approach, we simulate a WSNbased on a real-world deployment with 54 sensor nodes (IDfrom 1-54) in the Intel Research lab, which includes a traceof sensor readings collected between February and April2004 [32] and node locations. We suppose the communica-tion radius Rc ¼ 10 m. The sensors collected time-stampedhumidity, temperature, and voltage values in 31-secondintervals. Excluding two nodes having incomplete data andone node having abnormal data, we use the first 2,000epochs of the data in the day 03/08 from the remaining51 nodes. We assume a continuous aggregation query onthe temperature attribute during the first 2,000 epochs.During this period the temperature varies between 20 and35. The periodic verification is conducted with a timewindow size lþ 1 ¼ 200, and 10 time windows arenumbered in the order. We note that in the real trace nodeshave missing readings in some epochs and we estimatethese missing data by linear regression in a time window.

6.1 Performance of Local Sample Authentication

In this section, the representative epochs are uniformlychosen from a time window with an interval of 10 epochs.The performance of the local sample authentication isevaluated by the following two metrics:

. Approval rate of real samples. The ratio of the number ofnodes whose data can be successfully authenticated

by at least c neighbors to the total number of nodes inthe benign environment. Even in benign environ-ment, not all samples would be successfully authen-ticated in practice because (9a) and (9b) are twostatistical conditions and there may be not sufficientneighbors. The samples that cannot be authenticatedwill not be accepted by the BS. This metric indicatesthe degree of influence of the local sample authenti-cation on the availability of real samples.

. Disapproval rate of false samples. The ratio of the numberof false samples that cannot be successfully authenti-cated by up to c neighbors to the total number ofcompromised sampled nodes in the hostile environ-ment. It indicates the degree of the prevention of thefalse samples by the local sample authentication.

Fig. 4 illustrates the approval rate of real samples in eachtime window under different security threshold c. As wecan see, the approval rate in each time window decreases asc increases. This is because the number of nodes having upto c neighbors decreases as c increases. When c ¼ 1 andc ¼ 2, the approval rate is higher than 90 and 85 percent,respectively. However, the approval rate is lower than80 percent when c ¼ 3, which is because that the simulatednetwork is sparse (the average degree of the nodes is 5). Itindicates that with a reasonable value of c, here say 2, ourlocal sample authentication approach have a small effect onthe availability of real samples.

To measure the disapproval rate of false samples, weassume thatNc number of nodes are randomly compromisedin the network. We also assume a collusion attack in which acompromised node always provides valid authentication foranother one in the neighborhood. The security thresholdt ¼ 2. The compromised nodes generate false samples inthree manners: first, a false sample is generated by adding aconstant noise e ¼ 100 to each sensor reading in therepresentative epochs as ru;ei ru;ei þ e; second, ru;ei e,where e is drawn from a uniform random distributionUð20; 35Þ; third, ru;ei ru;ei þ e, where e is drawn from anormal distribution Nð20; 100Þ. The first manner preservesthe correlation between the real sample and the other ones inthe neighborhood. The second manner preserves theamplitude similarity. The third manner brings changes bothin the correlation and in the amplitude.

Figs. 5a, 5b, and 5c show the disapproval rate of falsesamples, respectively, generated by the above three man-ners under different Nc. The results are averaged over 50runs. In each run, Nc nodes are randomly selected ascompromised nodes. In each figure, we can see that thedisapproval rate of false samples decreases as Nc increasesin every time window. This is because that more compro-mised nodes would incur a higher probability for that a


Fig. 4. Approval rate of real samples in every time window.

compromised node providing false samples has c compro-mised neighbor nodes to launch the collusion attack. WhenNc ¼ 1 and Nc ¼ 4, the disapproval rate is higher than80 percent in all three figures. Since the network size is small(51 nodes), 10 compromised nodes make up a significantfraction of the network and cause the worst results.

6.2 Performance of Aggregation Verification

To evaluate our aggregation verification scheme, wesimulate a large-scale WSN of 1,000 nodes and the sensingreadings of each node are synthesized by adding randomnoises drawn from Nð0; 0:25Þ to the sensor readings of arandom node from the above real-world deployment. Weconsider continuous average and predicate count aggrega-tion that counts the number of nodes whose temperaturereadings are greater than 25 in the time window [800, 1,000].

We simulate two attacks against the continuous averageaggregation and predicate count aggregation in the timewindow [800, 1,000], respectively. In the attack againstaverage aggregation, the adversary aims to delay the truetime when the temperature rapidly increases. In the attackagainst predicate count aggregation, the adversary aims tofabricate a false fluctuation of predicate count value.

Figs. 6 and 7 show both the real aggregation results andforged ones. Real aggregation results have a rapid increasepattern between (800, 900) for both average and count. Fig. 8shows the population variance of temperature in everyepoch. For the representative point selection, the totalnumber of representative points p ¼ 20. To decide the

probability q of prespecifying random representative points,we investigate the continuity of real aggregation results,and we have that the value difference between twosuccessive epochs falls in ½�0:15; 0:15� with probability99.6 percent for average aggregation, and in ½�20; 20� withprobability 98.9 percent for predicate count aggregation.Then, we let � ¼ 0:15 and � ¼ 20 for two types ofaggregation, respectively. By (23) with � ¼ 0:9 and� ¼ 0:05, we have q � 0:033 for average aggregation with� ¼ 0:5 and q � 0:043 for count aggregation with � ¼ 50.Hence, we prespecify each data point in (800, 1,000) as arepresentative point with q ¼ 0:05.

Figs. 6 and 7 also show the representative points selectedby RPS-P algorithm. We can see these points well capturethe temporal variation in the continuous aggregation. In thefigures, some of points are clustered together, whichindicates it is not necessary to use as many as 20 numberof representative points to capture the patterns. Fig. 9 showsthe approximation errors with different numbers ofrepresentative points. Initially, representative point selec-tion for average aggregation has 12 prespecified pointsincluding randomly selected points, and for count aggrega-tion has 10 prespecified points. As we can see, theapproximation errors are maximum when using onlyprespecified points, and more representative points gainlittle after 13 and 11 numbers, respectively. Only oneadditional point can significantly reduce the approximationerror. This is because of single change point in variation


Fig. 6. Continuous average aggregation under attack.

Fig. 5. Disapproval rate under three manners of forging false samples.

Fig. 8. Population variance in each epoch of [800, 1,000].

Fig. 9. The number of representative points versus approximation error.Fig. 7. Continuous count aggregation under attack.

patterns. Given the maximum tolerant approximation error,we can determine the minimum number of representativepoints by the method in Section 4.2.3.

We use � ¼ 0:02 significant level in the hypothesistesting. In the simulation, we conduct two round samplingsas described in Section 5.2.1. The initial sampling ratio is setto 0.05. The sample variance of first-round collectedsamples is computed. Based on that new sampling ratio iscomputed with � ¼ 0:05 in (19) and (20), and additionalsamples are collected if necessary. Given different max-imum tolerable deviations �, Figs. 10a and 10b show thenew sampling ratios estimated by (19) and (20) for averageand predicate count aggregations in every epoch in [800,1,000], respectively. The figures indicate that less samplesare required for verification with a larger maximumtolerable deviation. The sampling ratio for average aggrega-tion verification is strongly correlated with the variance ofsensor readings, as we can see from Figs. 10a and 8. Incontrast, Fig. 10b shows that the sampling ratio forpredicate count verification has a lower correlation withpopulation variance than for average aggregation becausethe sampling ratio for count aggregation verification isdecided by the proportion of nodes satisfying the predicateand the aggregation result in every epoch, instead ofpopulation variance.

It is worth to note that the sampling ratio between epoch800 and 850 in Fig. 10b, which is estimated by (19), is not validbecause 2 goodness-of-fit test would be failed when thecount aggregation is zero or close to zero. As described inSection 4.4.2, we use binomial test to verify the smallernumber among two cells and compute the sampling ratiosatisfying (15). For Agcð�ÞðkÞ ¼ 0, we have mU ¼ 0 and Mg ¼0 in (15), and % should satisfy

P�i¼1ð�i Þ%ið1� %Þ

��i �1� � ¼ 0:95. Then, we derive % ¼ 0:03 for � ¼ 50, and % ¼0:015 for � ¼ 100. Here, the aggregation results claim nonodes having readings larger than 25. Our scheme actuallyensures at least one node returns its sample with a highprobability if there are at least � nodes having readingsgreater than 25.

We run simulations 100 times in two scenarios ofwithout attacks and under attacks, respectively, shown inFigs. 6 and 7. We find that our scheme achieves 100 percentdetection rate for the attacks against two types of aggrega-tions. Without any attacks, our scheme incurs about 5 and10 percent false alarm rate for predicate count aggregationwith � ¼ 50 and � ¼ 100, respectively. For averageaggregation, our scheme incurs about 4 and 7 percent falsealarm rate with � ¼ 0:5 and � ¼ 0:6, respectively. We cansee that a bigger � causes a higher false alarm rate. This isbecause the estimated sampling ratio decreases when �increases, and less samples cause higher probability of

Type I error. Our results indicate that the false alarm ratecan be decreased by choosing a smaller significant level� ¼ 0:01.

7 CONCLUSION

In this paper, we identify distinct design issues for securecontinuous aggregation in WSNs. An efficient verificationscheme is proposed to protect the authenticity of thetemporal variation patterns in the aggregation results.Compared with the existing secure aggregation schemes,our scheme only need to check a small portion of aggrega-tion results in a time window and, thus, greatly reduces theverification cost. We define representative points andpropose corresponding algorithms for representative pointselection. By exploiting the spatial correlation among thesensor readings in close proximity, a series of securitymechanisms are also proposed to protect the samplingprocedure. Our simulations validate our scheme design.

ACKNOWLEDGMENTS

This research was supported in part by the National GrandFundamental Research 973 Program of China under grant2012CB316200, the Major Program of National NaturalScience Foundation of China under grant 61190115, the KeyProgram of the National Natural Science Foundation ofChina under grant 61033015, and the National NaturalScience Foundation of China under grant 60933001, andin part by US National Science Foundation (NSF) grantsNSF-CSR 1025649, OCI-1064230, CNS-1049947, CNS-1156875, CNS-0917056, and CNS-1057530, CNS-1025652,CNS-0938189, Microsoft Research Faculty Fellowship8300751, and Sandia National Laboratories grant 10002282.A preliminary version of this paper appeared in theproceedings of IEEE INFOCOM 2011.

REFERENCES

[1] Z. Cai, S. Ji, J.S. He, and A.G. Bourgeois, “Optimal DistributedData Collection for Asynchronous Cognitive Radio Networks,”Proc. IEEE 32nd Int’l Conf. Distributed Computing Systems (ICDCS),pp. 245-254, 2012.

[2] S. Ji and Z. Cai, “Distributed Data Collection and Its Capacity inAsynchronous Wireless Sensor Networks,” Proc. IEEE INFOCOM,pp. 2113-2121, Mar. 2012.

[3] S. Ji, R. Beyah, and Z. Cai, “Snapshot/Continuous Data CollectionCapacity for Large-Scale Probabilistic Wireless Sensor Networks,”Proc. IEEE INFOCOM, pp. 1035-1043, Mar. 2012.

[4] C. Intanagonwiwat, R. Govindan, and D. Estrin, “DirectedDiffusion: A Scalable and Robust Communication Paradigm forSensor Networks,” Proc. ACM MobiCom, pp. 56-67, 2000.

[5] S. Madden, M.J. Franklin, J. Hellerstein, and W. Hong, “Tag: ATiny Aggregation Service for Ad-Hoc Sensor Networks,” Proc.Fifth Symp. Operating Systems Design and Implementation (OSDI),2002.

[6] K.-W. Fan, S. Liu, and P. Sinha, “On the Potential of Structure-FreeData Aggregation in Sensor Networks,” Proc. IEEE INFOCOM,pp. 1-12, 2006.

[7] A. Manjhi, S. Nath, and P.B. Gibbons, “Tributaries and Deltas:Efficient and Robust Aggregation in Sensor Network Streams,”Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 287-298,2005.

[8] B. Przydatek, D. Song, and A. Perrig, “SIA: Secure InformationAggregation in Sensor Networks,” Proc. ACM First Int’l Conf.Embedded Networked Sensor Systems (SenSys), pp. 255-265, 2003.

[9] Y. Yang, X. Wang, S. Zhu, and G. Cao, “SDAP: A Secure Hop-By-Hop Data Aggregation Protocol for Sensor Networks,” Proc. ACMMobiHoc, pp. 356-367, 2006.


Fig. 10. Sampling ratio for average and count aggregation in epochs of

[800, 1,000].

[10] H. Chan, A. Perrig, and D. Song, “Secure Hierarchical In-NetworkAggregation in Sensor Networks,” Proc. ACM 13th Conf. Computerand Comm. Security (CCS), pp. 278-287, 2006.

[11] K.B. Frikken and J.A. Dougherty IV, “An Efficient Integrity-Preserving Scheme for Hierarchical Sensor Aggregation,” Proc.ACM First Conf. Wireless Network Security (WiSec), pp. 68-76, 2008.

[12] B. Yu, J. Li, and Y. Li, “Distributed Data Aggregation Schedulingin Wireless Sensor Networks,” Proc. IEEE INFOCOM, pp. 2159-2167, 2009.

[13] D. Wagner, “Resilient Aggregation in Sensor Networks,” Proc.Second ACM Workshop Security of Ad Hoc and Sensor Networks,pp. 78-87, 2004.

[14] L. Hu and D. Evans, “Secure Aggregation for Wireless Networks,”Proc. Workshop Security and Assurance in Ad Hoc Networks, p. 384,2003.

[15] H. Yu, “Secure and Highly-Available Aggregation Queries inLarge-Scale Sensor Networks via Set Sampling,” Proc. ACM/IEEEInt’l Conf. Information Processing in Sensor Networks (IPSN), 2009.

[16] M. Garofalakis, J. Hellerstein, and P. Maniatis, “Proof Sketches:Verifiable In-Network Aggregation,” Proc. IEEE 32nd Int’l Conf.Data Eng. (ICDE), pp. 996-1005, Apr. 2007.

[17] S. Roy, M. Conti, S. Setia, and S. Jajodia, “Securely Computing anApproximate Median in Wireless Sensor Networks,” Proc FourthInt’l Conf. Security and Privacy in Comm. Networks, pp. 6:1-6:10,2008.

[18] P. Flajolet, G.N. Martin, and G.N. Martin, “Probabilistic CountingAlgorithms for Data Base Applications,” J. Computer and SystemSciences, vol. 31, pp. 182-209, 1985.

[19] S. Ganeriwal, S. Capkun, C. chieh Han, and M.B. Srivastava,“Secure Time Synchronization Service for Sensor Networks,” ProcFourth ACM Workshop Wireless Security, pp. 97-106, 2005.

[20] K. Sun and P. Ning, “TinySeRSync: Secure and Resilient TimeSynchronization in Wireless Sensor Networks,” Proc. ACM 13thConf. Computer and Comm. Security (CCS), pp. 264-277, 2006.

[21] S. Zhu, S. Setia, and S. Jajodia, “LEAP: Efficient SecurityMechanisms for Large-Scale Distributed Sensor Networks,” Proc.ACM 10th Conf. Computer and Comm. Security (CCS), pp. 62-72,2003.

[22] D. Liu and P. Ning, “Establishing Pairwise Keys in DistributedSensor Networks,” Proc. ACM 10th Conf. Computer and Comm.Security (CCS), pp. 52-61, 2003.

[23] A. Perrig, R. Szewczyk, V. Wen, D. Culler, and J.D. Tygar, “SPINS:Security Protocols for Sensor Networks,” Wireless Networks, vol. 8,no. 5, pp. 521-534, 2002.

[24] H. Gupta, V. Navda, S.R. Das, and V. Chowdhary, “EfficientGathering of Correlated Data in Sensor Networks,” Proc. ACMMobiHoc, pp. 402-413, 2005.

[25] S. Pattem, B. Krishnamachari, and R. Govindan, “The Impact ofSpatial Correlation on Routing with Compression in WirelessSensor Networks,” Proc. ACM/IEEE Third Int’l Symp. InformationProcessing in Sensor Networks (IPSN), pp. 28-35, 2004.

[26] S. Slijepcevic and M. Potkonjak, “Power Efficient Organization ofWireless Sensor Networks,” Proc. IEEE Int’l Conf. Comm., vol. 2,pp. 472-476, 2001.

[27] F. Liu, X. Cheng, and D. Chen, “Insider Attacker Detection inWireless Sensor Networks,” Proc. IEEE INFOCOM, pp. 1937-1945,May 2007.

[28] M.C. Vuran and I.F. Akyildiz, “Spatial Correlation-Based Colla-borative Medium Access Control in Wireless Sensor Networks,”IEEE/ACM Trans. Networking, vol. 14, no. 2, pp. 316-329, Apr. 2006.

[29] T.H.S.C. Yu, R.C., and J. Froines, “Quality Control of Semi-Continuous Mobility Size-Fractionated Particle Number Concen-tration Data,” Atmospheric Environment, vol. 38, no. 20, pp. 3341-3348, 2004.

[30] F.E. Grubbs, “Procedures for Detecting Outlying Observations inSamples,” Technometrics, vol. 11, no. 1, pp. 1-21, Feb. 1969.

[31] R.G.M. Bellare and P. Rogaway, “XOR MACs: New Methods forMessage Authentication Using Finite Pseudo-Random Functions,”Proc. Advances in Cryptology (Crypto), 1995.

[32] “Intel Lab Data,” http://berkeley.intel-research.net/labdata, 2013.[33] P.S. Mann, Introductory Statistics. John Wiley & Sons, 2006.

Lei Yu received the PhD degree in computerscience from the Harbin Institute of Technol-ogy, China, in 2011. He currently is a post-doctoral research fellow in the Department ofElectrical and Computer Engineering at Clem-son University, SC. His research interestsinclude sensor networks, wireless networks,cloud computing, and network security. He is amember of the IEEE.

Jianzhong Li is a professor in the School ofComputer Science and Technology at HarbinInstitute of Technology, China. In the past, hewas a visiting scholar at the University ofCalifornia at Berkeley, as a staff scientist in theInformation Research Group at the LawrenceBerkeley National Laboratory, and as a visitingprofessor at the University of Minnesota. Hisresearch interests include data managementsystems, sensor networks, and data intensive

computing. He has published more than 150 papers in refereed journalsand conferences, and has been involved in the program committeesof major computer science and technology conferences, includingSIGMOD, VLDB, ICDE, INFOCOM, ICDCS, and WWW. He has alsoserved on the editorial boards for distinguished journals, includingTKDE. He is a member of the IEEE.

Siyao Cheng received the PhD degree incomputer science from the Harbin Institute ofTechnology, China, in 2012. Her researchinterests include data management and wirelesscommunication in sensor networks.

Shuguang Xiong received the PhD degree incomputer science from Harbin Institute ofTechnology, China, in 2011. He is currently asoftware engineer in Baidu Inc., Beijing, China.His research interests include data managementand networking in wireless ad hoc and sensornetworks.

Haiying Shen received the BS degree incomputer science and engineering from TongjiUniversity, China in 2000, and the MS and PhDdegrees in computer engineering from WayneState University in 2004 and 2006, respectively.She is currently an assistant professor in theDepartment of Electrical and Computer Engi-neering at Clemson University, SC. Her researchinterests include distributed computer systemsand computer networks, with an emphasis on

peer-to-peer and content delivery networks, mobile computing, wirelesssensor networks, and grid and cloud computing. She was the programco-chair for a number of international conferences and member of theProgram Committees of many leading conferences. She is a MicrosoftFaculty fellow of 2010 and a member of the IEEE and ACM.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


762 ieee transactions on parallel and …ieeeprojectsmadurai.com/base/ns2/secure continuou… ·...

Documents