publishing set-valued data via differential privacy rui chen, concordia university noman mohammed,...

38
Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia University Bipin C. Desai, Concordia University Li Xiong, Emory University 1 VLDB 2011

Upload: itzel-chantry

Post on 14-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Publishing Set-Valued Data via Differential Privacy

Rui Chen, Concordia UniversityNoman Mohammed, Concordia UniversityBenjamin C. M. Fung, Concordia UniversityBipin C. Desai, Concordia UniversityLi Xiong, Emory University

1VLDB 2011

Page 2: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Outline

Introduction

Preliminaries

Sanitization algorithm

Experimental results

Conclusions

2

Page 3: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Introduction

• The problem: non-interactive set-valued data publication under differential privacy

3

• Typical set-valued data: transaction data, web search queries

Page 4: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Introduction

• Set-valued data refers to the data in which each record owner is associated with a set of items drawn from an item universe.

4

TID Items

t1 I1, I2, I3, I4

t2 I2, I4

t3 I2

t4 I1, I2

t5 I2

t6 I1

t7 I1, I2, I3, I4

t8 I2, I3, I4

Page 5: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Introduction

• Existing works [1, 2, 3, 4, 5, 6, 7] on publishing set-valued data are based on partitioned-based privacy models [8].

• They provide insufficient privacy protection.– Composition attack [8]– deFinetti attack [9]– Foreground knowledge attack [10]

• They are vulnerable to background knowledge.

5

Page 6: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Introduction

• Differential privacy is independent of an adversary’ background knowledge and computational power (with exceptions [11]).

• The outcome of any analysis should not overly depend on a single data record.

• Existing differentially private data publishing approaches are not adequate in terms of both utility and scalability for our problem.

6

Page 7: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Introduction

• Problems of data-independent publishing approaches:

7

Universe I = {I1, I2, I3 }

I1

I2

I3

I1, I2

I1, I3

I2, I3

I1, I2, I3

• Scalability: O(2n)• Utility: noise accumulates exponentially

Page 8: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Outline

Introduction

Preliminaries

Sanitization algorithm

Experimental results

Conclusions

8

Page 9: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Preliminaries

• Context-free taxonomy tree

9

• Each internal node is a set of their leaves, not necessarily the semantic generalization

Page 10: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Preliminaries

• Differential privacy [12]

10

A non-interactive privacy mechanism A gives ε-differential privacy if for all neighbours D, D’, and for any possible sanitized database D* Range(A),∈

PrA[A(D) = D*] ≤ exp(ε) × PrA[A(D’) = D*]

D D’

• D and D’ are neighbors if they differ on at most one record

Page 11: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Preliminaries

• Laplace mechanism [12]

11

For example, for a single counting query Q over a dataset D, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.

Global Sensitivity

Page 12: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Preliminaries

• Exponential mechanism [13]

12

Given a utility function q : (D × R) → R for a database instance D, the mechanism A,

A(D, q) = { return r with probability ∝

exp(ε×q(D, r)/2 q) } △

gives ε-differential privacy.

Page 13: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Preliminaries

• Composition properties [14]

13

Sequential composition ∑iεi –differential privacy

Parallel compositionmax(εi)–differential privacy

Page 14: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Preliminaries

• Utility metrics

14

For a given itemset I’ I , a counting query Q over a dataset D is defined to be

|}':{|)( tIDtDQ

A privacy mechanism A is (α, δ)-useful if with probability 1- δ, for every counting query and every dataset D, for D*=A(D), |Q(D*)-Q(D)|<= α. [15]

Page 15: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Outline

Introduction

Preliminaries

Sanitization algorithm

Experimental results

Conclusions

15

Page 16: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Sanitization Algorithm

• Top-down partitioning

16

• Generalize all records to a single partition• Keep partitioning non-empty partitions

until leaf partitions are reached

TID Items

t1 I1, I2, I3, I4

t2 I2, I4

t3 I2

t4 I1, I2

t5 I2

t6 I1

t7 I1, I2, I3, I4

t8 I2, I3, I4

Page 17: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Sanitization Algorithm

• Privacy budget allocation

17

• We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning.

• Assign less budget to more general partitions and more budget to more specific partitions.

Page 18: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Sanitization Algorithm

• Privacy budget allocation

18

Example: {I{1,2}, I{3, 4}} needs at most two partition operations to reach leaf partitions

A hierarchy cut needs at most cutuii HudesInternalNo |),(|

partition operations to reach leaf partitions.

Page 19: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Sanitization Algorithm

• Privacy budget allocation

19

• We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning.

• Assign less budget to more general partitions and more budget to more specific partitions.

B/2/3 = B/6

(B/2-B/6)/2 = B/6

B/6+B/2 = 2B/3

Page 20: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Sanitization Algorithm

• Sub-partition generation

20

• For a non-leaf partition, we need to consider all possible sub-partitions to satisfy differential privacy.

• Efficient implementation: separately handling empty and non-empty partitions (inspired by [16]).

Page 21: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Outline

Introduction

Preliminaries

Sanitization algorithm

Experimental results

Conclusions

21

Page 22: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Experiments

• Two real-life set-valued datasets are used.

22

• MSNBC is publicly available at UCI machine learning repository(http://archive.ics.uci.edu/ml/index.html).

• STM is provided by Societe de transport de Montreal (STM) (http://www.stm.info).

Page 23: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Experiments

• Average relative error vs. privacy budget

23

B=0.5

B=0.75

B=1.0

Page 24: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Experiments

• Utility for frequent itemset mining

24

B=0.5

B=0.75

B=1.0

Page 25: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Experiments

• Scalability: O(|D|*|I|)

25

Runtime vs. |D| Runtime vs. |I|

Page 26: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Outline

Introduction

Preliminaries

Sanitization algorithm

Experimental results

Conclusions

26

Page 27: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Conclusions

• Differential privacy can be successfully applied to non-interactive set-valued data publishing with guaranteed utility.

• Differential privacy can be achieved by data-dependent solutions with improved efficiency and accuracy.

• The general idea of data-dependent solutions applies to other types of data, for example, relational data [17] and trajectory data [18].

27

Page 28: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

References

[1] J. Cao, P. Karras, C. Raissi, and K.-L. Tan. ρ–uncertainty inference proof transaction anonymization. In VLDB, pp. 1033–1044, 2010.

[2] G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional data. In ICDE, pp. 715–724, 2008.

[3] Y. He and J. F. Naughton. Anonymization of set-valued data via top-down, local generalization. In VLDB, pp. 934–945, 2009.

[4] M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preserving anonymization of set-valued data. In VLDB, pp.115–125, 2008.

[5] M. Terrovitis, N. Mamoulis, and P. Kalnis. Local and global recoding methods for anonymizing set-valued data.VLDBJ, 20(1):83–106, 2011.

[6] Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM, pp. 1109–1114, 2008.

[7] Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In SIGKDD, pp. 767–775, 2008.

28

Page 29: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

References

[8] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In SIGKDD, pp. 265-273, 2008.

[9] D. Kifer. Attacks on privacy and deFinetti’s theorem. In SIGMOD, pp. 127–138, 2009.

[10] R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and J. Pei. Can the utility of anonymized data be used for privacy breaches,” ACM Transactions on Knowledge Discovery from Data, to appear.

[11] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In SIGMOD, 2011.

[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284, 2006.

[13] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pp. 94–103, 2007.

[14] F. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In SIGMOD, pp. 19–30, 2009.

[15] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pp.609–618, 2008.

29

Page 30: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

References

[16] G. Cormode, M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially Private Publication of Sparse Data. In CoRR, 2011.

[17] N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In SIGKDD, 2011.

[18] R. Chen, B. C. M. Fung, and B. C. Desai. Differentially private trajectory data publication. ICDE, under review, 2012.

30

Page 31: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Thank you!

Q & A

31

Page 32: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Backup Slides

32

Page 33: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Lower Bound Results

• In the interactive setting, only a limited number of queries could be answered; otherwise, an adversary would be able to precisely reconstruct almost the entire original database.

• In the non-interactive setting, one can only guarantee the utility of restricted classes of queries.

33

Page 34: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

34

Page 35: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

35

Page 36: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Threshold Selection

• We design the threshold as a function of the standard deviation of the noise and the height of a partition’s hierarchy cut:

36

./).(2 2 pcutpheightC

Page 37: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Relative error

• (α, δ)-usefulness is effective to give an overall estimation of utility, but fails to produce intuitive experimental results.

• We experimentally measure the utility of sanitized data for counting queries by relative error:

37

}),(max{

|)(*)(|

sDQ

DQDQ Sanity bound

Page 38: Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia

Experiments

• Average relative error vs. taxonomy tree fan-out

38

B=0.5

B=0.75

B=1.0