privacy-preserving anonymization of set value data

Privacy-preserving Anonymization of Set Value Data

Manolis Terrovitis, Nikos MamoulisUniversity of Hong Kong

Panos KalnisNational University of Singaporewww.comp.nus.edu.sg/~kalnis

2

Motivation

Attacker can see up to m items Any m items No distinction between sensitive and non-sensitive items

0% M

ilk

Preg

nanc

y

test

Beer

Helen

3

Motivation (cont.)

Helen: Beer, 0% Milk, Pregnancy testJohn: Cola, CheeseTom: 2% Milk, Coffee….Mary: Wine, Beer, Full-fat Milk

Database

t1: Beer, 0%Milk, Pregnancy testt2: Cola, Cheeset3: 2% Milk, Coffee….tn: Wine, Beer, Full-fat Milk

Published

AttackerFind all transactions that contain Beer & 0% Milk

t1: Beer, Milk, Pregnancy testt2: Cola, Cheeset3: Milk, Coffee….tn: Wine, Beer, Milk

4

km-anonymity

Di

tttD

t

ooo

,...,

,...,,

21

21

Set of items

Transaction

Database

tqsDttres |

kresres 0

mqs Query terms

km-anonymity:

5

Related Work: K-Anonymity [Swe02]

Age ZipCode Disease

42 25000 Flu

46 35000 AIDS

50 20000 Cancer

54 40000 Gastritis

48 50000 Dyspepsia

56 55000 Bronchitis

[Swe02] L. Sweeney. k-Anonymity: A Model for Protecting Privacy. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557-570, 2002.

(a) Microdata

Quasi-identifier

Age ZipCode Disease

42-46 25000-35000 Flu

42-46 25000-35000 AIDS

50-54 20000-40000 Cancer

50-54 20000-40000 Gastritis

48-56 50000-55000 Dyspepsia

48-56 50000-55000 Bronchitis

(a) 2-anonymous microdata

NOT suitable for high-dimensionality

6

Related Work: L-diversity in Transactions

[GTK08] G. Ghinita, Y. Tao, P. Kalnis, “On the Anonymization of Sparse High-Dimensional Data”, ICDE, 2008

Requires knowledge of (non)-sensitive attributes

7

Our Approach: Employs Generalization

Aaa 21,

Gen

era

lizati

on

H

iera

rch

y

otherwise ,

node leaf ,0)(

pupNCP

Information loss

k=2m=2

8

Lattice of Generalizations

9

Count Tree

1221

1212122 ,,,

,,,,,,,,

baBaAbAB

baBABAbabat

A1B

12a

11b

1

1b1

B1

2a1

1b1

23 2 2

10

Optimal Algorithm

Q: Q: Q:

11

“Direct” Anonymization

COUNT({a1,a2})=1

Solves each “problem” independently

12

“Apriori-based” AnonymizationConstruct the count-tree incrementally

Prune unnecessary branches

13

Small Datasets (2-15K, BMS-WebView2)

|I|=40..60, k=100, m=3

14

Small Datasets (BMS-WebView2)

|D|=10K, k=100, m=1..4

15

Apriori Anonymization for Large Datasets

500

sec

10se

c

100

sec

|D| |I|

515K 1657

59K 497

77K 3340

k=5 m=3

16

Points to Remember

Anonymization of Transactional Data Attacker knows m items Any m items can be the quasi-identifier

Global recoding method Optimal solution: too slow Apriori Anonymization: fast and low information

loss On-going work

Local recoding (sort by Gray order and partition)

Transactional data in streaming environments

17

Bibliography on LBS Privacy

http://anonym.comp.nus.edu.sg

privacy-preserving anonymization of set value data

Documents

slowapriori anonymization

sensitive attributes

nonsensitive items0

pregnancy testt2

knowledge of

anonymity swe02swe02

small datasets bmswebview2d

knowledgebased systems