1 on the anonymization of sparse high-dimensional data 1 national university of singapore...

19
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong [email protected] Gabriel Ghinita 1 Yufei Tao 2 Panos Kalnis 1

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

On the Anonymization of Sparse High-Dimensional

Data

1 National University of Singapore{ghinitag,kalnis}@comp.nus.edu.sg

2 Chinese University of Hong [email protected]

Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1

2

Publishing Transaction Data Publishing transaction data

Retail chain-owned shopping cart data

Infer consumer spending patterns

Correlations among purchased items

e.g., 90% of cereals buyers also buy milk

What about privacy?

3

Privacy Threat

Quasi-identifying

Items

Sensitive

Items

4

Privacy Paradigm ℓ-diversity

prevent association between quasi-identifier and sensitive attributes

Create groups of transactions freq. of an SA value in a group < 1/p

Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality

5

Data Re-organization

Band Matrix Organization

PRESERVES

CORELATIONS!

6

Published Data

Summary of Sensitive Items

7

Contributions Novel data representation

Preserves correlation among items

Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items

State-of-the-art: Mondrian[FWR06]

Generalization-based data-space partitioning similar to k-d-trees

split recursively until privacy condition does not hold

constrained global recoding

k = 2

[FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006

Age

20 40 60

Weig

ht

40

60

80

100

GENERALIZATION + HIGH DIMENSIONALITY

=

UNACCEPTBLE INFORMATION LOSS

State-of-the-art: Anatomy[XT06]

Permutation-based method discloses exact QID values

DiseaseUlcer(1)

Pneumonia(1)Flu(1)

Dyspepsia(1)

Gastritis(1) Dyspepsia(1)

[XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006

Age ZipCode42 5200047 4300051 3200062 4100055 2700067 55000

Age ZipCode Disease

42 52000 Ulcer47 43000 Pneumonia51 32000 Flu55 27000 Gastritis62 41000 Dyspepsia67 55000 Dyspepsia

“Anatomized” table|G|! permutationsRANDOM GROUP FORMATION

DOES NOT PRESERVE CORRELATIONS

10

Band Matrix Representation

Bandwidth = U+L+1 Minimizing bandwidth is NP-hard

11

Reverse Cuthil-McKee (RCM) Heuristic Bandwidth Minimization

Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D

N = matrix rows (# transactions) D = maximum degree of any vertex

12

Group Formation Correlation-aware Anonymization of High-

Dimensional Data (CAHD)

Use the order given by RCM Consecutive transactions highly correlated

O(pN) complexity

13

Group Formation

Experimental Evaluation

15

RCM Visualization

16

Experimental Setting BMS dataset Compare with hybrid PermMondrian(PM)

Combines Mondrian with Anatomy Query Workload

Reconstruction Error

17

Recostruction Error vs p

18

Execution Time

19

Conclusions Anonymizing transaction data

High-dimensionality Preserving correlation

Future work Different encodings for data representation

Enhance correlation among consecutive rows