achieving anonymity via clustering

33
Dilys Thomas PODS 2006 1 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu

Upload: reegan

Post on 04-Feb-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Achieving Anonymity via Clustering. G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas , A. Zhu. Talk outline. k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular clustering Future Work. Medical Records. De-identified Medical Records. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 1

Achieving Anonymity via Clustering

G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas,

A. Zhu

Page 2: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 2

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular clustering

• Future Work

Page 3: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 3

Medical RecordsIdentifying Sensitive

SSN Name DOB Race Zip code Disease

614 Sara 03/04/76 Cauc 94305 Flu

615 Joan 07/11/80 Cauc 94307 Cold

629 Kelly 05/09/55 Cauc 94301 Diabetes

710 Mike 11/23/62 Afr-A 94305 Flu

840 Carl 11/23/62 Afr-A 94059 Arthritis

780 Joe 01/07/50 Hisp 94042 Heart problem

619 Rob 04/08/43 Hisp 94042 Arthritis

Page 4: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 4

De-identified Medical RecordsSensitive

Age Race Zip code Disease

Cauc 94305 Flu

07/11/80 Cauc 94307 Cold

05/09/55 Cauc 94301 Diabetes

11/23/62 Afr-A 94305 Flu

11/23/62 Afr-A 94059 Arthritis

01/07/50 Hisp 94042 Heart problem

04/08/43 Hisp 94042 Arthritis

03/04/76

Page 5: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 5

k-Anonymity model

Uniquelyidentify

you!

Sensitive

DOB Race Zip code Disease

03/04/76 Cauc 94305 Flu

07/11/80 Cauc 94307 Cold

05/09/55 Cauc 94301 Diabetes

12/30/72 Afr-A 94305 Flu

11/23/62 Afr-A 94059 Arthritis

01/07/50 Hisp 94042 Heart problem

04/08/43 Hisp 94042 Arthritis

Quasi-identifiers:approximate foreign keys

Page 6: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 6

k-Anonymity Model [Swe00]

• Suppress some entries of quasi-identifiers – each modified row becomes identical to at least

k-1 other rows with respect to quasi-identifiers

• Individual records hidden in a crowd of size k

Page 7: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 7

2-Anonymized Table

DOB Race Zip code Disease

* Cauc * Flu

* Cauc * Cold

* Cauc * Diabetes

11/23/62 Afr-A * Flu

11/23/62 Afr-A * Arthritis

* Hisp 94042 Heart problem

* Hisp 94042 Arthritis

Page 8: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 8

k-Anonymity Optimization

• Minimize the number of generalizations/ suppressions to achieve k-Anonymity

• NP-hard to come up with minimum suppressions/ generalizations.[MW04]

• (k) approximation for k-anonymity [AFK+05]

• (k) lower bound on approximation ratio with graph assumption

Page 9: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 9

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular Clustering

• Future Work

Page 10: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 10

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Page 11: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 11

2-Anonymity with Suppression

Age Salary

Amy * *

Brian * *

Carol * *

David * *

Evelyn * *

All attributes suppressed

Page 12: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 12

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Page 13: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 13

2-Anonymity with Generalization

Age Salary

Amy 20-30 50-100

Brian 20-30 50-100

Carol 20-30 50-100

David 30-40 100-150

Evelyn 30-40 100-150

Generalization allows pre-specified ranges

Page 14: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 14

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Page 15: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 15

2-Anonymity with Clustering

Age Salary

Amy [25-29] [50-100]

Brian [25-29] [50-100]

Carol [25-29] [50-100]

David [35-39] [110-120]

Evelyn [35-39] [110-120]

Cluster centers published

27=(25+27+29)/3

70=(50+60+100)/3

37=(35+39)/2

115=(110+120)/2

Page 16: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 16

Advantages of Clustering

• Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations

• Clustering allows constant factor approximation algorithms

Page 17: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 17

Quasi-Identifiers form a Metric Space

• Convert quasi-identifiers into points in a metric space

• Distance function, D, on points– D(X,X)=0 Reflexive– D(X,Y)=D(Y,X) Symmetric– D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality

Page 18: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 18

Metric Space

• Converting (gender, zip code, DOB) into points in a metric space not easy.

• Define distance function on each attribute.• E.g. on Zip code:

– D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2.

• Weight attributes, weighted sum of attribute distances gives metric.

Page 19: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 19

Clustering for Anonymity

• Cluster Quasi-identifiers so that each cluster has at least r members for anonymity.

• Publish cluster centers for anonymity with number of point and radius

• Tight clusters Usefulness of data for mining

• Large number of points per cluster Anonymity

Page 20: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 20

Quasi-identifiers: Metric Space

• Assume further that the distance metric has been already defined on

quasi-identifiers

Page 21: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 21

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular Clustering

• Future Work

Page 22: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 22

r-Gather Clustering

10 points, radius 5

20 points, radius 10

50 points, radius 20

• Minimize the maximum radius: 20

Page 23: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 23

Results

• 2 Approximation to minimize maximum radius with cluster size constraint

• Matching Lower bound of 2 for maximum radius minimization

Page 24: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 24

r-Gather Clustering

2d2d

2d

Page 25: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 25

Lower Bound: Reduction from 3-SAT

X1T

X1F

X2T

X2F

r-2 points r-2 points

• r-gather with radius 1 iff formula satisfiable

Else radius ¸ 2

C1=X1 Æ X2

C1

Page 26: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 26

Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular Clustering

• Future Work

Page 27: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 27

Cellular Clustering

10 points, radius 5

20 points, radius 10

50 points, radius 20

Page 28: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 28

Cellular Clustering Metric

10 points, radius 5

20 points, radius 10

50 points, radius 20

Cellular Clustering Metric: 10*5 + 20*10 + 50*20

= 50 + 200 + 1000 = 1250

Page 29: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 29

Cellular Clustering

• Primal dual 4-approximation algorithm for cellular clustering

• Constant factor approximation to minimum cluster size– Each cluster has at least r points

Page 30: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 30

Cellular Clustering: Linear Program

Minimize c ( i xicdc + fc yc)

Sum of Cellular cost and facility cost

Subject to:

c xic ¸ 1 Each Point belongs to a cluster

xic· yc Cluster must be opened for point to belong

0 · xic · 1 Points belong to clusters positively

0 · yc · 1 Clusters are opened positively

Page 31: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 31

Dual Program

• Maximize i i

• Subject to:

i ic · fc (1)

i - ic · dc (2)

i ¸ 0

ic ¸ 0

Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight

Page 32: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 32

Future Work

• Improve approximation ratio for Cellular Clustering

• Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables.– Linear or even sub-linear time algorithms

• Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.

Page 33: Achieving Anonymity via Clustering

Dilys Thomas PODS 2006 33

THANK YOU!

QUESTIONS?