pirs: query verification on data streams

40
PIRS: Query Verification on Data Streams Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs work done while the 1 st and 2 nd authors were working at AT&T

Upload: vianca

Post on 03-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

PIRS: Query Verification on Data Streams. Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PIRS: Query Verification on Data Streams

PIRS: Query Verification on Data Streams Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs

work done while the 1st and 2nd authors were working at AT&T labs.

Page 2: PIRS: Query Verification on Data Streams

Publishing Data and Outsourcing Query Service

2

NetworkNetwork

Gigascope:analysis tool by

IP Traffic Streamcoming from

0 1 1 0 0 1 … 1 1 0 …

statistics

Results

Page 3: PIRS: Query Verification on Data Streams

Revisiting the CISCO – AT&T Example

3

NetworkNetworkGigascopeIP Traffic Stream

0 1 1 0 0 1 … 1 1 0 …

statistics

lawyers: sign the trust agreementCould we help? (computer scientists)

Page 4: PIRS: Query Verification on Data Streams

Concrete Example

Continuous Query:

SELECT SUM(packet_size) FROM IP_trace

GROUP BY srcIP, destIP

Answer:

4

pm p3 p2 p1. . .

IP Stream:

: srcIP, destIP, packet_size

1 2 3 . . . n

5 10KB 2KB 150KB . . . 5KB

10 11KB 130KB 1MB . . . 20KB

13 . . .Tim

e

Groups

Page 5: PIRS: Query Verification on Data Streams

Continuous Query Verification (CQV) on Data Streams

5

1. Client register query2. Server reports answer

upon request Server maintains exact answer

Client maintains synopsis XBoth client

and server monitorthe same stream

Source of streams

Group 1

Group 2

Group 3

SELECT SUM(packet_size) From IP_TraceGROUP BY src_ip, dest_ip

Page 6: PIRS: Query Verification on Data Streams

The Model for the Stream

6

n

ii mv

1

9|1 7|iS 1|1 …

0VT 0 0 0…V1 V2 V3 Vn

9 0

Vi

710

T=1 T=2 T=3

agg_attribute | group_id

Page 7: PIRS: Query Verification on Data Streams

Continuous Query Verification: CQV

TVA

7

0VT 0 0 0…V1 V2 V3 Vn

9 0

Vi

710

9|1 7|iS 1|1 …

T=1 T=2 T=3

Update V

XT

Synopsis

Update X

0 0 2 0…V1 V2 V3 Vn

9 0

Vi

510 1

Alarm

TVA

0 0 0…V1 V2 V3 VnVi

710 1

no alarm

Page 8: PIRS: Query Verification on Data Streams

PIRS: Polynomial Identity Random Synopsis

,max2,max mnpmn

PZa

pnaaaVX nvvvT mod)()2()1()( 21

8

choose prime p:

chose a random number :

)()(?

TT VXVXA

raise alarm if not equal

o/w no alarm

)()()(:ilityDecomposab baba VXVXVVX

Page 9: PIRS: Query Verification on Data Streams

Incremental Update to PIRS

9

91 )1( aX

9|1 7|iS …

T=1 T=2

update to v1 update to vi

712 )( iaXX

An update to group i with value u could be done in logu time (exponential by squaring): )(1 iaXX

1|1

update to v1

123 )1( aXX

Page 10: PIRS: Query Verification on Data Streams

It Solves CQV problem!

TT WV

alarm no raisesobvously W,V if 1. WV if 2.

10

Theorem: Given any PIRS raises an alarm

with probability at least 1-δ

nwnx

wx

wxxwf

nvnx

vx

vxxvf )(2)2(1)1()( ,)(2)2(1)1()(

WV iff )()( xfxf wv

a polynomial with 1 as the leading coefficient is completely determinedby its zeroes

Due to the fundamental theorem of algebra.

)()( ,WV if xfxf wv happens at no more than m values of x

Since we have p>m/ δ choices for a: the probability that X(V)=X(W) is at most δ

Page 11: PIRS: Query Verification on Data Streams

Optimality of PIRS

11

Theorem: PIRS occupies O(log m/δ + log n) bits of space (3 words only at most, i.e., p, a, X(V)), spends O(1) time to process a tuple for count query, or O(log u) time to processa tuple for sum query.

Theorem: Any synopsis for solving the CQV problem witherror probability at most δ has to keep Ω(log minn,m/δ) bits.

Page 12: PIRS: Query Verification on Data Streams

Multiple Queries

12

Q1 Q2

X1 X2

Q1 Q2

X

9|1,8S …

update to v1 update to v8

Theorem: our synopses use constant space for multiple queries.

V1..n1V1..n2 V1..(n1+n2)

Page 13: PIRS: Query Verification on Data Streams

Handle the Load Shedding

13

Semantic Load Shedding: drop tuples from certain groups Small number of groups having errors

Random Load Shedding: All groups have small amount of errors

Page 14: PIRS: Query Verification on Data Streams

CQV with Semantic Load Shedding

14

Randomly drop certain tuples according to groups

9|1 7|i 2|j 1|1 4|k …5|1

Server claims at most γ number of groups have errors

To detect if more than γ groups having errors!

We have designed synopses using O(γ log 1/δ log n) bits of space and achieve the error probability at most δ

Page 15: PIRS: Query Verification on Data Streams

PIRSγ: An Exact Solution819.4for 1

21 cck

15

k mod p mody xie.g., ,,...,1 touniformly

,...,1 mapsfucntion hash t independen wise-pair a ,

k

nb

PIRS PIRS PIRS…

k buckets Alarm

v8

b(8)=2

If at least buckets raise alarms

PIRS PIRS PIRS…

log 1/δ

AlarmIf at least one layer raises alarms

21

Page 16: PIRS: Query Verification on Data Streams

PIRSγ: An Exact Solution

16

Theorem: PIRSγ requires O(γ2 log1/δ logn) bits, spendsO( log1/δ ) time to process a tuple and solves CQV with semantic load shedding.

Page 17: PIRS: Query Verification on Data Streams

Intuition on Approximation

17

number of errors

probability to raise alarm

γ

the ideal synopsis

γ- γ+

the approximation

Page 18: PIRS: Query Verification on Data Streams

PIRS±γ: An Approximate Solution

18

Theorem: PIRS±γ requires O(γ log1/δ logn) bits, spendsO(γ log1/δ ) time to process a tuple.

Page 19: PIRS: Query Verification on Data Streams

CQV with Random Load Shedding

19

Randomly drop tuples

All groups have small errors

To detect if any group has error greater than a claimed threshold

Theorem: Any synopsis solves this problem with errorprobability at most δ requires at least Ω(n) bits (reducingto the problem of estimating infinite frequency moment: the number of occurrence of the most frequent item).

Page 20: PIRS: Query Verification on Data Streams

Sliding Window and Other Queries It is easy to extend PIRS to work with sliding

window model since it is decomposable, i.e., X(v1+v2)=X(v1)*X(v2).

Other queries that can be transformed into Group By aggregation queries.

Details in the paper.

20

Page 21: PIRS: Query Verification on Data Streams

Some Experiments

21

We use real streams: World Cup Data (WC) IP traces from the AT&T network (IP)

We perform the following query: WC: Aggregate on response size and group

by client id/object id (50M groups) IP: Aggregate on packet size and group by

source IP/destination IP (7M groups) Hardware for the client:

2.8GHz Intel Pentium 4 CPU 512 MB memory Linux Machine

Page 22: PIRS: Query Verification on Data Streams

Detection Accuracy

22

groups ofnumber

actual not the groups, ofnumber potential by the determined is

105.0/ hence, ,10,1022 9101964

n

pmmn

Over 100,000 random attacks, PIRS identifies all of them.

Page 23: PIRS: Query Verification on Data Streams

Memory Usage of Exact

23PIRS using only constant 3 words (27 bytes) at all time.

Exact’s memory usage is linear and expensive.

Page 24: PIRS: Query Verification on Data Streams

Update Time (per tuple) of Exact

24

1. Exact is fast when memory usage is small.2. It becomes extremely slow due to cache misses and memory

swap operations.

Cache misses and memory swap

Page 25: PIRS: Query Verification on Data Streams

Running Time Analysis

25

WC IPs

Count 0.98 μs 0.98 μs

Sum 8.01 μs 6.69 μs

Average Update Time

IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC

Page 26: PIRS: Query Verification on Data Streams

Multiple Queries: Exact Memory Usage

26PIRS always using only constant 3 words (27 bytes).

Exact’s memory usage is linear w.r.t number of queries and increasing over time.

Page 27: PIRS: Query Verification on Data Streams

Multiple Queries: Exact Update Time Per Tuple

27

Page 28: PIRS: Query Verification on Data Streams

Multiple Queries: PIRS Update Time Per Tuple

28

Page 29: PIRS: Query Verification on Data Streams

The Library

29

Download PIRS and other synopses at:

http://www.cs.fsu.edu/~lifeifei/pirs/

Page 30: PIRS: Query Verification on Data Streams

Conclusion Space and Update efficient synopsis for

verifying continuous group-by aggregation queries on streaming data;

Could be generalized to handle selection query, and sliding-window semantics;

How about more complicated queries?

30

Page 31: PIRS: Query Verification on Data Streams

Thanks!

31

Questions

Page 32: PIRS: Query Verification on Data Streams

Problem and Goals

32

Assumption: Client and DSMS observe the same stream

Problem: Client needs to verify the results

Goals: Be memory, update efficient Tolerance for a limited number of errors Tolerance for small errors Support multiple queries

Page 33: PIRS: Query Verification on Data Streams

Related Techniques to PIRS

33

Incremental Cryptography Block operation (insert, delete), cannot support

arithmetic operation Program Verification Server may pass the program execution but

simply return random outputs Fingerprinting Technique PIRS is a fingerprinting technique

Page 34: PIRS: Query Verification on Data Streams

CQV with Semantic Load Shedding

|),( ii vviVVE

),( iff VVEVV

VV if -1least at alarm raises s.t. synopsisDesign

34

),( iff VVEVV

VV if alarm no raises and

Page 35: PIRS: Query Verification on Data Streams

PIRS±γ: An Approximate Solution

)ln

1( wherec

VV

)ln

1( wherec

VV

35

Theorem: PIRS±γ: 1.raises no alarm with probability at least 1- δ on any

2.raises an alarm with probability at least 1- δ on any

For any c>-lnln2=0.367

Using the intuition of coupon collector problem

and the Chernoff bound.

Page 36: PIRS: Query Verification on Data Streams

PIRS±γ: An Approximate Solution

kk ln s.t.,k choose

36

numbers randomt independne wise-n , ...1nbb

k,...,1in ddistributeuniformly

PIRS PIRS PIRS…

k buckets Alarm

vi

bi=2

If all k buckets raise alarms

PIRS PIRS PIRS…

log 1/δ

AlarmIf majority layers raise alarms

Page 37: PIRS: Query Verification on Data Streams

Information Disclosure on Multiple Attacks

ron alarman raises PIRS |),( :witness RrVVW

WRVVW ),( :witness-non

PIRSby used seeds random of space :R

VVRVVW if , |),(|

37

VVRVVW if , |),(|

R

VV if

PIRS: X(V) on r

V turnsRe

Learns nothing about ralarman received and if VV ),( VVW

),( Learns VVWr

Insight: server could potentially gets rid of δ portion of seeds from each notified failed attack!

Page 38: PIRS: Query Verification on Data Streams

Information Disclosure on Multiple Attacks

38

Bob

Theorem: For the total of k attacks made by Bob to PIRS, the probability that none of them succeeds is at least 1-kδ.

Page 39: PIRS: Query Verification on Data Streams

Proof of the Optimality

..., 21)( ffFff ifp

i

39

MUfX : )()( assuming 21 fpfp

F fromfunction thedescribe tobits log

andoutput for bits logleast at needs

F

MX

)()(|,

VfVfFfFVV

VVF

fp

,f

)( :Xfor

Page 40: PIRS: Query Verification on Data Streams

Proof of the Optimality

)(1

k

i ifp

n2U

)log(log :else nF

)n(MlogF

40

kffFk ...consider and ,1 1

functionsk theseof outputs for the nscombinatio possible M of totalk

kMU hole,pigeon by

M1)logF(Ulog

))((log)log()1(Flog nMn