on the anonymization of sparse high-dimensional data

19
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong [email protected] Gabriel Ghinita 1 Yufei Tao 2 Panos Kalnis 1

Upload: fuller-craig

Post on 30-Dec-2015

47 views

Category:

Documents


4 download

DESCRIPTION

On the Anonymization of Sparse High-Dimensional Data. 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong [email protected]. Publishing Transaction Data. Publishing transaction data Retail chain-owned shopping cart data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On the Anonymization of Sparse High-Dimensional Data

1

On the Anonymization of Sparse High-Dimensional

Data

1 National University of Singapore{ghinitag,kalnis}@comp.nus.edu.sg

2 Chinese University of Hong [email protected]

Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1

Page 2: On the Anonymization of Sparse High-Dimensional Data

2

Publishing Transaction Data Publishing transaction data

Retail chain-owned shopping cart data

Infer consumer spending patterns

Correlations among purchased items

e.g., 90% of cereals buyers also buy milk

What about privacy?

Page 3: On the Anonymization of Sparse High-Dimensional Data

3

Privacy Threat

Quasi-identifying

Items

Sensitive

Items

Page 4: On the Anonymization of Sparse High-Dimensional Data

4

Privacy Paradigm ℓ-diversity

prevent association between quasi-identifier and sensitive attributes

Create groups of transactions freq. of an SA value in a group < 1/p

Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality

Page 5: On the Anonymization of Sparse High-Dimensional Data

5

Data Re-organization

Band Matrix Organization

PRESERVES

CORELATIONS!

Page 6: On the Anonymization of Sparse High-Dimensional Data

6

Published Data

Summary of Sensitive Items

Page 7: On the Anonymization of Sparse High-Dimensional Data

7

Contributions Novel data representation

Preserves correlation among items

Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items

Page 8: On the Anonymization of Sparse High-Dimensional Data

State-of-the-art: Mondrian[FWR06]

Generalization-based data-space partitioning similar to k-d-trees

split recursively until privacy condition does not hold

constrained global recoding

k = 2

[FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006

Age

20 40 60

Weig

ht

40

60

80

100

GENERALIZATION + HIGH DIMENSIONALITY

=

UNACCEPTBLE INFORMATION LOSS

Page 9: On the Anonymization of Sparse High-Dimensional Data

State-of-the-art: Anatomy[XT06]

Permutation-based method discloses exact QID values

DiseaseUlcer(1)

Pneumonia(1)Flu(1)

Dyspepsia(1)

Gastritis(1) Dyspepsia(1)

[XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006

Age ZipCode42 5200047 4300051 3200062 4100055 2700067 55000

Age ZipCode Disease

42 52000 Ulcer47 43000 Pneumonia51 32000 Flu55 27000 Gastritis62 41000 Dyspepsia67 55000 Dyspepsia

“Anatomized” table|G|! permutationsRANDOM GROUP FORMATION

DOES NOT PRESERVE CORRELATIONS

Page 10: On the Anonymization of Sparse High-Dimensional Data

10

Band Matrix Representation

Bandwidth = U+L+1 Minimizing bandwidth is NP-hard

Page 11: On the Anonymization of Sparse High-Dimensional Data

11

Reverse Cuthil-McKee (RCM) Heuristic Bandwidth Minimization

Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D

N = matrix rows (# transactions) D = maximum degree of any vertex

Page 12: On the Anonymization of Sparse High-Dimensional Data

12

Group Formation Correlation-aware Anonymization of High-

Dimensional Data (CAHD)

Use the order given by RCM Consecutive transactions highly correlated

O(pN) complexity

Page 13: On the Anonymization of Sparse High-Dimensional Data

13

Group Formation

Page 14: On the Anonymization of Sparse High-Dimensional Data

Experimental Evaluation

Page 15: On the Anonymization of Sparse High-Dimensional Data

15

RCM Visualization

Page 16: On the Anonymization of Sparse High-Dimensional Data

16

Experimental Setting BMS dataset Compare with hybrid PermMondrian(PM)

Combines Mondrian with Anatomy Query Workload

Reconstruction Error

Page 17: On the Anonymization of Sparse High-Dimensional Data

17

Recostruction Error vs p

Page 18: On the Anonymization of Sparse High-Dimensional Data

18

Execution Time

Page 19: On the Anonymization of Sparse High-Dimensional Data

19

Conclusions Anonymizing transaction data

High-dimensionality Preserving correlation

Future work Different encodings for data representation

Enhance correlation among consecutive rows