dilys thomas pods 20061 achieving anonymity via clustering g. aggarwal, t. feder, k. kenthapadi, s....

33
Dilys Thomas PODS 2006 1 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Dilys Thomas PODS 2006 1

Achieving Anonymity via Clustering

G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas,

A. Zhu

Dilys Thomas PODS 2006 2

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular clustering

• Future Work

Dilys Thomas PODS 2006 3

Medical RecordsIdentifying Sensitive

SSN Name DOB Race Zip code Disease

614 Sara 03/04/76 Cauc 94305 Flu

615 Joan 07/11/80 Cauc 94307 Cold

629 Kelly 05/09/55 Cauc 94301 Diabetes

710 Mike 11/23/62 Afr-A 94305 Flu

840 Carl 11/23/62 Afr-A 94059 Arthritis

780 Joe 01/07/50 Hisp 94042 Heart problem

619 Rob 04/08/43 Hisp 94042 Arthritis

Dilys Thomas PODS 2006 4

De-identified Medical RecordsSensitive

Age Race Zip code Disease

Cauc 94305 Flu

07/11/80 Cauc 94307 Cold

05/09/55 Cauc 94301 Diabetes

11/23/62 Afr-A 94305 Flu

11/23/62 Afr-A 94059 Arthritis

01/07/50 Hisp 94042 Heart problem

04/08/43 Hisp 94042 Arthritis

03/04/76

Dilys Thomas PODS 2006 5

k-Anonymity model

Uniquelyidentify

you!

Sensitive

DOB Race Zip code Disease

03/04/76 Cauc 94305 Flu

07/11/80 Cauc 94307 Cold

05/09/55 Cauc 94301 Diabetes

12/30/72 Afr-A 94305 Flu

11/23/62 Afr-A 94059 Arthritis

01/07/50 Hisp 94042 Heart problem

04/08/43 Hisp 94042 Arthritis

Quasi-identifiers:approximate foreign keys

Dilys Thomas PODS 2006 6

k-Anonymity Model [Swe00]

• Suppress some entries of quasi-identifiers – each modified row becomes identical to at least

k-1 other rows with respect to quasi-identifiers

• Individual records hidden in a crowd of size k

Dilys Thomas PODS 2006 7

2-Anonymized Table

DOB Race Zip code Disease

* Cauc * Flu

* Cauc * Cold

* Cauc * Diabetes

11/23/62 Afr-A * Flu

11/23/62 Afr-A * Arthritis

* Hisp 94042 Heart problem

* Hisp 94042 Arthritis

Dilys Thomas PODS 2006 8

k-Anonymity Optimization

• Minimize the number of generalizations/ suppressions to achieve k-Anonymity

• NP-hard to come up with minimum suppressions/ generalizations.[MW04]

• (k) approximation for k-anonymity [AFK+05]

• (k) lower bound on approximation ratio with graph assumption

Dilys Thomas PODS 2006 9

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular Clustering

• Future Work

Dilys Thomas PODS 2006 10

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Dilys Thomas PODS 2006 11

2-Anonymity with Suppression

Age Salary

Amy * *

Brian * *

Carol * *

David * *

Evelyn * *

All attributes suppressed

Dilys Thomas PODS 2006 12

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Dilys Thomas PODS 2006 13

2-Anonymity with Generalization

Age Salary

Amy 20-30 50-100

Brian 20-30 50-100

Carol 20-30 50-100

David 30-40 100-150

Evelyn 30-40 100-150

Generalization allows pre-specified ranges

Dilys Thomas PODS 2006 14

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Dilys Thomas PODS 2006 15

2-Anonymity with Clustering

Age Salary

Amy [25-29] [50-100]

Brian [25-29] [50-100]

Carol [25-29] [50-100]

David [35-39] [110-120]

Evelyn [35-39] [110-120]

Cluster centers published

27=(25+27+29)/3

70=(50+60+100)/3

37=(35+39)/2

115=(110+120)/2

Dilys Thomas PODS 2006 16

Advantages of Clustering

• Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations

• Clustering allows constant factor approximation algorithms

Dilys Thomas PODS 2006 17

Quasi-Identifiers form a Metric Space

• Convert quasi-identifiers into points in a metric space

• Distance function, D, on points– D(X,X)=0 Reflexive– D(X,Y)=D(Y,X) Symmetric– D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality

Dilys Thomas PODS 2006 18

Metric Space

• Converting (gender, zip code, DOB) into points in a metric space not easy.

• Define distance function on each attribute.• E.g. on Zip code:

– D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2.

• Weight attributes, weighted sum of attribute distances gives metric.

Dilys Thomas PODS 2006 19

Clustering for Anonymity

• Cluster Quasi-identifiers so that each cluster has at least r members for anonymity.

• Publish cluster centers for anonymity with number of point and radius

• Tight clusters Usefulness of data for mining

• Large number of points per cluster Anonymity

Dilys Thomas PODS 2006 20

Quasi-identifiers: Metric Space

• Assume further that the distance metric has been already defined on

quasi-identifiers

Dilys Thomas PODS 2006 21

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular Clustering

• Future Work

Dilys Thomas PODS 2006 22

r-Gather Clustering

10 points, radius 5

20 points, radius 10

50 points, radius 20

• Minimize the maximum radius: 20

Dilys Thomas PODS 2006 23

Results

• 2 Approximation to minimize maximum radius with cluster size constraint

• Matching Lower bound of 2 for maximum radius minimization

Dilys Thomas PODS 2006 24

r-Gather Clustering

2d2d

2d

Dilys Thomas PODS 2006 25

Lower Bound: Reduction from 3-SAT

X1T

X1F

X2T

X2F

r-2 points r-2 points

• r-gather with radius 1 iff formula satisfiable

Else radius ¸ 2

C1=X1 Æ X2

C1

Dilys Thomas PODS 2006 26

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular Clustering

• Future Work

Dilys Thomas PODS 2006 27

Cellular Clustering

10 points, radius 5

20 points, radius 10

50 points, radius 20

Dilys Thomas PODS 2006 28

Cellular Clustering Metric

10 points, radius 5

20 points, radius 10

50 points, radius 20

Cellular Clustering Metric: 10*5 + 20*10 + 50*20

= 50 + 200 + 1000 = 1250

Dilys Thomas PODS 2006 29

Cellular Clustering

• Primal dual 4-approximation algorithm for cellular clustering

• Constant factor approximation to minimum cluster size– Each cluster has at least r points

Dilys Thomas PODS 2006 30

Cellular Clustering: Linear Program

Minimize c ( i xicdc + fc yc)

Sum of Cellular cost and facility cost

Subject to:

c xic ¸ 1 Each Point belongs to a cluster

xic· yc Cluster must be opened for point to belong

0 · xic · 1 Points belong to clusters positively

0 · yc · 1 Clusters are opened positively

Dilys Thomas PODS 2006 31

Dual Program

• Maximize i i

• Subject to:

i ic · fc (1)

i - ic · dc (2)

i ¸ 0

ic ¸ 0

Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight

Dilys Thomas PODS 2006 32

Future Work

• Improve approximation ratio for Cellular Clustering

• Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables.– Linear or even sub-linear time algorithms

• Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.

Dilys Thomas PODS 2006 33

THANK YOU!

QUESTIONS?