1 a privacy preserving index for range queries bijit hore, sharad mehrotra, gene tsudik

1

A Privacy Preserving Index for Range Queries

Bijit Hore, Sharad Mehrotra, Gene Tsudik

2

Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002]

A client wants to store data on a remote server & run queries on it

BUT he does not trust the server Solution: Encrypt the data & store it How do you query the encrypted data ?

Encrypted & Indexed

Client Data

Server

Untrusted

Service Provider

Query Post Processor

Query Translator

True Results

Original Query

Query over Encrypted Data

Encrypted Results

Trusted

Client

User

3

Data storage in DAS

etuple sharesA ageA salA

X@#$^&FJ X1 Y2 Z1

CH$^*(G#!

X2 Y1 Z1

^$*D%L*# X3 Y2 Z2

*%GH%&)$ X3 Y3 Z3

Original Table (plain text) R

Server side Table (encrypted + indexed) RA

Bucket-tags

eid name

addr shares age sal

345 Tom Maple 5400 32 390K

876 Mary Main 5800 22 423K

234 John River 6000 34 598K

780 Jerry Ocean

6200 48 632K

0 200 450 600 650 700

Z0 Z1 Z2 Z3 Z4

buckets

Meta data

Server side data

Client side storage

4

Querying in DAS

etuple sharesA ageA salA

X@#$^&FJ X1 Y2 Z1

CH$^*(G#!

X2 Y1 Z1

^$*D%L*# X3 Y2 Z2

*%GH%&)$ X3 Y3 Z3

Client side Table (plain text) R

Server side Table (encrypted + indexed) RA

Bucket-tags

Client-side query

Server-side query

Select etuple from RA where RA.salA = z1 ∨ z2

Select * from R where R.sal [400K, 600K]

eid name

addr shares age sal

345 Tom Maple 5400 32 390K

876 Mary Main 5800 22 426K

234 John River 6000 34 598K

780 Jerry Ocean

6200 48 634K

Client side Table (plain text) R

5

Issues in partitioning

How many buckets should one use ?

How to partition the data ?

6

Data Privacy in DAS

AdversaryAccess to sever-side data +Malicious Intentions

Privacy issue in partitioned dataSmall range of a bucket B +1 sample value from B

Privacy goal of clientTo hide all useful information from A

Put all values of an attribute in a single bucket !

Adversary (A)

“Almost total” disclosure of all

elements in B

7

Research challenges & our contributions

Precision: how to partition data Definition Optimal partitioning to maximize precision

Privacy: quantifying disclosure Adversary’s goals Measures of information disclosure

Privacy-Precision trade-off Controlled diffusion algorithm

Experiments & Conclusion

Privacy Precision

8

Precision of range queries Given a partition of data into M parts Precision (q) = 1 – (# false positives / # tuples returned for q) Recall = 1 Workload: All O(N2) range queries are equiprobable (uniform)

1 2 3 4 5 6 7 8 9 10Salary (100K’s)

4 44 4 4

10

2

6

2

10

Frequency NB=5,FB=18

N = 10

(domain size)

q Precision =

1 – 20/50 = 0.6

# false positive α ∑ NB*FB = 5*32 + 5*18 = 250B

M = 2

9

Query optimal buckets (QOB) Optimization problem:

For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e.

1 2 3 4 5 6 7 8 9 10Salary (100K’s)

4 44 4 4

10

2

6

2

10Cost(8,10)

Frequency

QOB (1,7,3) +QOB (1,10,4) =

Optimal solution to a sub-problem Cost of rightmost bucket

NB*FB = 24

B=1

Minimize ∑ NB*FB

N = 10

(domain size)

4

10

QOB (cont.)

1 2 3 4 5 6 7 8 9 10Salary(100K’s)

4

B1

44 4 4

10

2

6

2

10

B2 B3 B4

Frequency

Optimal cost = ∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3 = 1101

4

Time complexity = O(n2M), Space = O(nM)

n = # distinct values in dataset; M = # buckets

11

Outline

Optimal data partitioning for range queries

Adversarial goals & privacy measures

Balancing privacy and precision

Experiments & conclusion

12

Adversary’s learning model

Need to learn bucket properties to estimatesensitive values

Model

A’s Domain knowledge +

Sample values from buckets

Worst case assumption for Privacy Analysis:

A knows exact value distribution for every bucket

A learns distribution of

values in buckets

13

Adversarial Goal (I)

Individual Centric Information: Eg: “What is the salary of an individual I”

Value Estimation Power (VEP) of A

Variance of bucket-distribution is an inverse measure of VEP

Bucket range

Average error of value estimation for Adversary

LargeSmall

Preferred: Large

varianceSmall variance

Bucket range

14

Adversarial Goal (II)

Query Centric Information: Eg: “Which individuals have salary [100k,150k]”

Set Estimation Power (SEP) of A

Entropy of bucket-distribution is an inverse measure of SEP*

Bucket range

Average error of query-set estimation for Adversary

Small Large100k 150k 100k 150k

Best case: high

entropy + large

variance

Bucket range

low entropy + large

variance

H(X) = - ∑ pilogpi

15

Outline





16

Privacy-Precision Trade-off

Optimal buckets might offer less privacy than desired Small variance

partial disclosure of numeric value Small entropy

Total disclosure with high probability (e.g. categorical data)

Partial detection of query-sets (for all cases)

Algorithm that allows trading-off bounded amount

of query precision for greater variance and entropy

Objective

17

The controlled diffusion algorithm

A simple observation

B0

CB1 CB2 CB3

• Let a query Q overlap only with B0

• If elements of B0 are distributed

into CB1, CB2 & CB3 randomly

• Now Q overlaps with CB1, CB2 & CB3

• With new buckets, the precision for Q drops by factor of

(|CB1|+|CB2|+|CB3|) / |B0|

Any re-distribution scheme

where B∀ i this ratio ≤ K precision degradation is bounded above by K

Q

18

Controlled diffusion Algorithm Compute optimal buckets on data set D B1 … BM

Fix max degradation factor = K

Initialize M empty composite buckets CB1 … CBM

Set target size of each CB to

fCB = |D|/M (equidepth)

∀ Bi select di CB’s at random, where

di = K*|Bi|/fCB

Diffuse elements of Bi into these uniformly at random

19

1 2 3 4 5 6 7 8 9 10

Freq

Values

4 44 4 4

10 10

6

2 2

B1 B2 B3 B4

2 2 2

2 4 2

4 2 2 3

3 4 2 3

2 2 2 3 4

CB1

CB3

CB2

CB4

CB1

CB2

CB3

CB4

Query optimal buckets

1 2 3 4 5 6 7 8 9 10

10

Degradation factor k = 2

Composite Buckets

Controlled Diffusion (Example)

25.12

12*2*

|f(CB)|

|f(B)|K

Final set of buckets on

server

5.124

50)( CBf

Metadata size increases from

O(M) to O(KM)

20

Some features of the diffusion algorithm

Many consecutive optimal buckets might get diffused into common set of CB’s Observed precision degradation < K

Elements with same values can go to multiple buckets Giving it an extra degree of freedom compared to hashing Not best for point queries

Random choice in the algorithm Each bucket distribution approaches data distribution as K

increases reducing information gained by adversary by learning buckets

21

Outline





22

Experiments

Data sets Synthetic Data: 105 Integers in [0,999]

uniformly at random

Real Data: 104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive)

Query workloads (2 of size 104 each) End points chosen uniformly at random from

the respective ranges

23

Ratio of Average Precision (synthetic)(a)

0

0.5

1

1.5

2

2.5

3

100 150 200 250 300 350 400

Number of Buckets

Pre

c (

QO

B)

/ P

rec

(C

B's

)

k = 2 k = 4 k = 6 k = 8 k = 10

Ratio of Average Std Deviation (synthetic)(a)

0

50

100

150

200

250

300

350

400

450

100 150 200 250 300 350 400

Number of Buckets

Std

Dev

(C

B)

/ Std

Dev

(Q

OB

)

k = 2 k = 4 k = 6 k = 8 k = 10

Ratio of Average Entropy (synthetic)(a)

0

0.5

1

1.5

2

2.5

3

3.5

4

100 150 200 250 300 350 400

Number of Buckets

En

tro

py

(CB

) / E

ntr

op

y (Q

OB

)

k = 2 k = 4 k = 6 k = 8 k = 10

1. Relative decrease in precision of composite buckets

2. Relative increase in standard deviation in composite buckets

3. Relative increase in entropy in composite buckets

24

Composite buckets (sample)

Histogram

0102030405060708090

100

1510

4.4

193.

828

3.2

372.

646

255

1.4

640.

873

0.2

819.

6M

ore

BinF

req

ue

nc

y

FrequencyHistogram

0102030405060708090

100

Bin

Fre

qu

ency

Frequency

K = 6, M = 350 K = 10, M = 250

25

Trade-off (Precision Vs Entropy)

0

1

2

3

4

5

6

7

0 0.2 0.4 0.6 0.8 1 1.2

Average Precision

Av

era

ge

En

tro

py

Opt-Buckts

CB(k=2)

CB (k=4)

CB (k=6)

CB (k=8)

CB (k=10)

Trade-off (Precision vs Std. Dev)

0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8 1 1.2

Average Precision

Av

era

ge

Std

. De

via

tio

n

• Visualizing trade-offs for various bucketization parameters

• Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2

• The same point in the precision vs standard deviation trade-off space

• Provides an easy way to visualize the design space and choose parameters of interest

26

Summary

An optimal algorithm for partitioning data for range queries

Statistical measures of data privacy Variance Entropy

Fast & simple algorithm for re-bucketizing data Bounded amount of precision degradation Substantial increase in privacy level

27

Related work

Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”.

Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”.

Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.

28

THANK YOU !

Questions ?

29

Privacy in DAS Here goal of “Data Privacy” is not just

ensuring “non-disclosure of identity”. It is more general !

Privacy-preserving DM & Statistical DB

DAS

• Privacy criteria: Protect against disclosure of identity

• Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible

• Privacy criteria: Hide as much information as possible (even at the aggregate level)

• Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy)

30

Individual Privacy MeasureAverage Squared Error of Estimation

(ASEE) Error in approximating true value of a r.v XB by

another r.v XB’ (learned by A)

ASEE(XB,XB’) =

Var(XB) + Var(XB’) + (E(XB) – E(XB’))2

Variance of bucket distribution, Var(XB) is our

measure of individual privacy (lower bound)

31

Set oriented Privacy Measure

Entropy of bucket distribution is our measure for query-centric privacy

Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data)

An inverse measure of the quality of partial solution sets* that A can derive for a query

H(X) = - ∑ pilogpi

32

Meta data size increase in diffusion

The meta data increases from O(M) toK*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb

= (K/fcb) * (|B1| + |B2| + … + |BM|)

= (KM/|D|)*|D| = O(KM)

1 a privacy preserving index for range queries bijit hore, sharad mehrotra, gene tsudik

Documents