cs573 data privacy and security statistical databases li xiong

42
CS573 Data Privacy and Security Statistical Databases Li Xiong

Upload: cleopatra-barrett

Post on 16-Jan-2016

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS573 Data Privacy and Security Statistical Databases Li Xiong

CS573 Data Privacy and Security

Statistical Databases

Li Xiong

Page 2: CS573 Data Privacy and Security Statistical Databases Li Xiong

Today

• Statistical databases– Definitions– Early query restriction methods– Output perturbation and differential privacy

Page 3: CS573 Data Privacy and Security Statistical Databases Li Xiong

Statistical Data Release

Age City Diagnosis

25 Lilburn mantle cell lymphoma

35 Decatur adult T-cell lymphoma

35 Lilburn adult T-cell lymphoma

Diagnosis

Age

city

20

30

40

50

50

Population count

• Release statistical summary of the data (vs. individual records)• Useful for analysis and learning

• Medical statistics• Query log statistics – frequent search terms

• Still need rigorous inference control

Page 4: CS573 Data Privacy and Security Statistical Databases Li Xiong

• A statistical database is a database which provides statistics on subsets of records

• Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records

• Inference control to prevent inference from statistics to individual records

Statistical Database

Page 5: CS573 Data Privacy and Security Statistical Databases Li Xiong

Methods Data perturbation/anonymization Query restriction Output perturbation

Page 6: CS573 Data Privacy and Security Statistical Databases Li Xiong

Data Perturbation

Noise Added

User 2

Query

Results

OriginalDatabase

PerturbedDatabase

User 1

Que

ry

Res

ults

Page 7: CS573 Data Privacy and Security Statistical Databases Li Xiong

Query Resitrction

Query 1

Query 1Results

Query 2Results

Query 2

K KQuery

Results

QueryResults

OriginalDatabase

Page 8: CS573 Data Privacy and Security Statistical Databases Li Xiong

Noise Addedto Results

User 2

Query

Results

OriginalDatabase

User 1

Query

Results

Output Perturbation

Query

Query Results

Results

Page 9: CS573 Data Privacy and Security Statistical Databases Li Xiong

Methods Data perturbation/anonymization Query restriction

Query set size control Query set overlap control Query auditing

Output perturbation

Page 10: CS573 Data Privacy and Security Statistical Databases Li Xiong

Query Set Size Control A query-set size control limit the number of

records that must be in the result set Allows the query results to be displayed only if

the size of the query set |C| satisfies the condition

K <= |C| <= L – Kwhere L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

Page 11: CS573 Data Privacy and Security Statistical Databases Li Xiong

Query Set Size Control

Query 1

Query 1Results

Query 2Results

Query 2

K KQuery

Results

QueryResults

OriginalDatabase

Page 12: CS573 Data Privacy and Security Statistical Databases Li Xiong

Tracker• Q1: Count ( Sex = Female ) = A• Q2: Count ( Sex = Female OR

(Age = 42 & Sex = Male & Employer = ABC) ) = B

What if B = A+1?

Page 13: CS573 Data Privacy and Security Statistical Databases Li Xiong

Tracker• Q1: Count ( Sex = Female ) = A• Q2: Count ( Sex = Female OR

(Age = 42 & Sex = Male & Employer = ABC) ) = B

If B = A+1

• Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia)

Positively or negatively compromised!

Page 14: CS573 Data Privacy and Security Statistical Databases Li Xiong

Query set size control

• If the threshold value k is large, then it will restrict too many queries– And still does not guarantee protection from

compromise • The database can be easily compromised

within a frame of 4-5 queries

Page 15: CS573 Data Privacy and Security Statistical Databases Li Xiong

• Basic idea: successive queries must be checked against the number of common records.

• If the number of common records in any query exceeds a given threshold, the requested statistic is not released.

• A query q(C) is only allowed if:| q (C ) ^ q (D) | ≤ r, r > 0

Where r is set by the administrator

Query Set Overlap Control

Page 16: CS573 Data Privacy and Security Statistical Databases Li Xiong

Query-set-overlap control

• Statistics for a set and its subset cannot be released – limiting usefulness

• High processing overhead – every new query compared with all previous ones

• Multiple users - need to keep user profile, need to consider collusion between users

• Still no formal privacy guarantee

Page 17: CS573 Data Privacy and Security Statistical Databases Li Xiong

Auditing

• Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued

• Excessive computation and storage requirements

• Only “efficient” methods for special types of queries

Page 18: CS573 Data Privacy and Security Statistical Databases Li Xiong

Audit Expert (Chin 1982)• Query auditing method for SUM queries• A SUM query can be considered as a linear equation

where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result

• A set of SUM queries can be thought of as a system of linear equations

• Maintains the binary matrix representing linearly independent queries and update it when a new query is issued

• A row with all 0s except for ith column indicates disclosure

Page 19: CS573 Data Privacy and Security Statistical Databases Li Xiong

Audit Expert

• Only stores linearly independent queries

• Not all queries are linearly independentQ1: Sum(Sex=M)Q2: Sum(Sex=M AND Age>20)Q3: Sum(Sex=M AND Age<=20)

Page 20: CS573 Data Privacy and Security Statistical Databases Li Xiong

Audit Expert

• O(L2) time complexity• Further work reduced to O(L) time and space

when number of queries < L• Only for SUM queries

Page 21: CS573 Data Privacy and Security Statistical Databases Li Xiong

Auditing – recent developments

• Online auditing– “Detect and deny” queries that violate privacy

requirement– Denial themselves may implicitly disclose sensitive

information• Offline auditing

– Check if a privacy requirement has been violated after the queries have been executed

– Not to prevent

Page 22: CS573 Data Privacy and Security Statistical Databases Li Xiong

Methods Data perturbation/anonymization Query restriction Output perturbation

Differential privacy

Page 23: CS573 Data Privacy and Security Statistical Databases Li Xiong

• Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set – E.g.: Q = select count() where Age = [20,30] and Diagnosis

= B

Differential Privacy

Output Perturbation

D2Bob out

UserQ

D1Bob in

A(D2)

A(D1)

Page 24: CS573 Data Privacy and Security Statistical Databases Li Xiong

• Differential privacy

• Laplace mechanism Q(D) + Y where Y is drawn from

• Query sensitivity

Differential Privacy

Differentially Private

Interface

D2Bob out

UserQ

D1Bob in

A(D1) = Q(D1) + Y1

A(D2) = Q(D2) + Y2

Page 25: CS573 Data Privacy and Security Statistical Databases Li Xiong

Composition of Differential Privacy• Sequential composition [McSherry SIGMOD 09]

– Let Mi each provides differential privacy. The sequence of Mi provides differential privacy

• Parallel composition– If Di are disjoint subsets of the original database and Mi

provides differential privacy for each Di, then the sequence of Mi provides differential privacy.

Differentially Private

Interface

D2Bob out

UserQ1,Q2, …

D1Bob in

A1(D2), A2(D2), …

A1(D1), A2(D1), …

Page 26: CS573 Data Privacy and Security Statistical Databases Li Xiong

Differential Privacy• Is unfettered access to raw data truly essential?• Is released data sufficient (provide sufficient utility

guarantee)?

Raw Data

ReleasedData

UserPrivacymechanism

Age City Diagnosis

25 Lilburn mantle cell lymphoma

35 Decatur adult T-cell lymphoma

35 Lilburn adult T-cell lymphoma

Diagnosis

Age

city

count

Page 27: CS573 Data Privacy and Security Statistical Databases Li Xiong

Challenges

• Differential privacy cost accumulates quickly with number of queries– Typical tasks require multiple queries or multiple

steps– Need to support multiple users

• Impossible to guarantee utility for all (any) data or all (any) applications

Page 28: CS573 Data Privacy and Security Statistical Databases Li Xiong

Possible Middle Ground

• Guaranteed utility for certain applications– Counting queries, classification, logistic regression

• Guaranteed utility for certain kinds of data– Use prior or domain knowledge about data– Use intermediate results (differentially private)

Raw Data

ReleasedData

UserPrivacymechanism

Prior or domain knowledge

Target Applications

Intermediate Result

Page 29: CS573 Data Privacy and Security Statistical Databases Li Xiong

Our Research: Adaptive Differentially Private Data Release• Data knowledge

• Dense and “smooth” data• High dimensional and sparse data• Dynamic data

• Application knowledge• Query workload• Specific tasks

Page 30: CS573 Data Privacy and Security Statistical Databases Li Xiong

Histogram Example

?

Page 31: CS573 Data Privacy and Security Statistical Databases Li Xiong

Strategy I: Baseline Cell Partitioning

diagnosis

Age50 10

50 90

A B

20

30

• Goal: to release a differentially private histogram to support random predicate queries

• Q: select count() where Age = [20,30] and Income = 40K• If a query predicate consists of multiple cells or partitions, it will have

aggregated perturbation error

Diagnosis

Age50’ 10’

50’ 90’

20

30

A B

Q1: count() where Age = 20, Diagnosis = AQ2: count() where Age = 20, Diagnosis = B…

Q

alphaDP

Page 32: CS573 Data Privacy and Security Statistical Databases Li Xiong

Strategy II: Hierarchical Partitioning

• Large perturbation error due to small divided privacy budget at each level

200’20

30

A B

60’

140’

20

30

A B

50’ 10’

50’ 90’

20

30

A B

alpha/3

alpha/3

alpha/3

diagnosis

Age50 10

50 90

A B

20

30

Page 33: CS573 Data Privacy and Security Statistical Databases Li Xiong

DPCube Strategy: Two phase partitioning

• If a query predicate is contained in a published partition, the answer has to be estimated typically based on a uniform distribution assumption. This introduces an approximation error.

Age 100’10’

90’

20

30

A Bdiagnosis

Age50 10

50 90

A B

20

30

Page 34: CS573 Data Privacy and Security Statistical Databases Li Xiong

DPCube Strategy: Two phase partitioning

50’ 10’

50’ 90’

20

30

50’

50’

10’

90’

20

30

100’10’

90’

20

30

Cell histogram

partitionhistogram

diagnosis

Age50 10

50 90

A B

20

30

1. Cell Partitioning

2. Multi-dimensionalPartitioning

A B

A B

A B

Page 35: CS573 Data Privacy and Security Statistical Databases Li Xiong

Partitioning Algorithm• Define a uniformity (randomness) measure for a partition

H(Dt)– information gain, variance

• Recursive algorithm Partition(Dt) for a given partition Dt• Find the best splitting point (e.g. largest information gain) and Partition

the data into Dt1 and Dt2• Partition(Dt1) and Partition(Dt2)

Page 36: CS573 Data Privacy and Security Statistical Databases Li Xiong

Privacy and Utility of the Released Histogram

• The released data satisfies -differential privacy• Support for count queries and other OLAP queries and

learning tasks• Formal utility results

– (epsilon,delta) - usefulness• Experimental results for partition histogram

– CENSUS dataset, 1M tuples, 4 attributes: Age (79), Education (14), Occupation (23), and Income (100)

– Report absolute error and relative error for random count queries

Page 37: CS573 Data Privacy and Security Statistical Databases Li Xiong

DPCube Result Example

Original histogram Diff. Private Cell histogram

Diff. private partition histogram Diff. Private Estimated Cell histogram

Page 38: CS573 Data Privacy and Security Statistical Databases Li Xiong

Experimental Results: Comparison with other partitioning strategies

• Higher alpha (lower privacy) results in lower error (higher utility) • Kd tree based approach outperforms others• Cell partitioning is comparable in absolute error but suffers in relative error

due to the sparsity of the data

Page 39: CS573 Data Privacy and Security Statistical Databases Li Xiong

39

High dimensional sparse data

• Many real-world data are high dimensional and sparse

• Web search log data, web transactions, etc.• A direct application of the 2-phase approach

– Cell histogram highly inaccurate– Computationally not scalable

Page 40: CS573 Data Privacy and Security Statistical Databases Li Xiong

Top-down recursive partitioning

• Recursively partition the spaces that have sufficient density• Use a context free taxonomy tree• Dynamically allocate and keep track of the budget

Page 41: CS573 Data Privacy and Security Statistical Databases Li Xiong

Adaptive Hierarchical Strategy1a. Overall count

n. Partition count

2a. Partition count

1b. Partitioning of non-sparse regions

2b. Partitioning of non-sparse regions

Data is sparse and Highly dimensional

Page 42: CS573 Data Privacy and Security Statistical Databases Li Xiong

Today

• Statistical databases– Definitions– Early query restriction methods– Output perturbation and differential privacy