achieving anonymity via clustering

Dilys Thomas PODS 2006 1

Achieving Anonymity via Clustering

G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas,

A. Zhu


Talk outline

• k-Anonymity model

• Achieving Anonymity via Clustering

• r-Gather clustering

• Cellular clustering

• Future Work


Medical RecordsIdentifying Sensitive

SSN Name DOB Race Zip code Disease

614 Sara 03/04/76 Cauc 94305 Flu

615 Joan 07/11/80 Cauc 94307 Cold

629 Kelly 05/09/55 Cauc 94301 Diabetes

710 Mike 11/23/62 Afr-A 94305 Flu

840 Carl 11/23/62 Afr-A 94059 Arthritis

780 Joe 01/07/50 Hisp 94042 Heart problem

619 Rob 04/08/43 Hisp 94042 Arthritis


De-identified Medical RecordsSensitive

Age Race Zip code Disease

Cauc 94305 Flu

07/11/80 Cauc 94307 Cold

05/09/55 Cauc 94301 Diabetes

11/23/62 Afr-A 94305 Flu

11/23/62 Afr-A 94059 Arthritis

01/07/50 Hisp 94042 Heart problem

04/08/43 Hisp 94042 Arthritis

03/04/76


k-Anonymity model

Uniquelyidentify

you!

Sensitive

DOB Race Zip code Disease

03/04/76 Cauc 94305 Flu

07/11/80 Cauc 94307 Cold

05/09/55 Cauc 94301 Diabetes

12/30/72 Afr-A 94305 Flu

11/23/62 Afr-A 94059 Arthritis

01/07/50 Hisp 94042 Heart problem

04/08/43 Hisp 94042 Arthritis

Quasi-identifiers:approximate foreign keys


k-Anonymity Model [Swe00]

• Suppress some entries of quasi-identifiers – each modified row becomes identical to at least

k-1 other rows with respect to quasi-identifiers

• Individual records hidden in a crowd of size k


2-Anonymized Table

DOB Race Zip code Disease

* Cauc * Flu

* Cauc * Cold

* Cauc * Diabetes

11/23/62 Afr-A * Flu

11/23/62 Afr-A * Arthritis

* Hisp 94042 Heart problem

* Hisp 94042 Arthritis


k-Anonymity Optimization

• Minimize the number of generalizations/ suppressions to achieve k-Anonymity

• NP-hard to come up with minimum suppressions/ generalizations.[MW04]

• (k) approximation for k-anonymity [AFK+05]

• (k) lower bound on approximation ratio with graph assumption


Talk outline




• Cellular Clustering

• Future Work


Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120


2-Anonymity with Suppression

Age Salary

Amy * *

Brian * *

Carol * *

David * *

Evelyn * *

All attributes suppressed


Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120


2-Anonymity with Generalization

Age Salary

Amy 20-30 50-100

Brian 20-30 50-100

Carol 20-30 50-100

David 30-40 100-150

Evelyn 30-40 100-150

Generalization allows pre-specified ranges


Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120


2-Anonymity with Clustering

Age Salary

Amy [25-29] [50-100]

Brian [25-29] [50-100]

Carol [25-29] [50-100]

David [35-39] [110-120]

Evelyn [35-39] [110-120]

Cluster centers published

27=(25+27+29)/3

70=(50+60+100)/3

37=(35+39)/2

115=(110+120)/2


Advantages of Clustering

• Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations

• Clustering allows constant factor approximation algorithms


Quasi-Identifiers form a Metric Space

• Convert quasi-identifiers into points in a metric space

• Distance function, D, on points– D(X,X)=0 Reflexive– D(X,Y)=D(Y,X) Symmetric– D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality


Metric Space

• Converting (gender, zip code, DOB) into points in a metric space not easy.

• Define distance function on each attribute.• E.g. on Zip code:

– D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2.

• Weight attributes, weighted sum of attribute distances gives metric.


Clustering for Anonymity

• Cluster Quasi-identifiers so that each cluster has at least r members for anonymity.

• Publish cluster centers for anonymity with number of point and radius

• Tight clusters Usefulness of data for mining

• Large number of points per cluster Anonymity


Quasi-identifiers: Metric Space

• Assume further that the distance metric has been already defined on

quasi-identifiers


Talk outline





• Future Work


r-Gather Clustering

10 points, radius 5

20 points, radius 10


• Minimize the maximum radius: 20


Results

• 2 Approximation to minimize maximum radius with cluster size constraint

• Matching Lower bound of 2 for maximum radius minimization


r-Gather Clustering

2d2d

2d


Lower Bound: Reduction from 3-SAT

X1T

X1F

X2T

X2F

r-2 points r-2 points

• r-gather with radius 1 iff formula satisfiable

Else radius ¸ 2

C1=X1 Æ X2

C1


Talk outline





• Future Work


Cellular Clustering

10 points, radius 5




Cellular Clustering Metric

10 points, radius 5



Cellular Clustering Metric: 10*5 + 20*10 + 50*20

= 50 + 200 + 1000 = 1250


Cellular Clustering

• Primal dual 4-approximation algorithm for cellular clustering

• Constant factor approximation to minimum cluster size– Each cluster has at least r points


Cellular Clustering: Linear Program

Minimize c ( i xicdc + fc yc)

Sum of Cellular cost and facility cost

Subject to:

c xic ¸ 1 Each Point belongs to a cluster

xic· yc Cluster must be opened for point to belong

0 · xic · 1 Points belong to clusters positively

0 · yc · 1 Clusters are opened positively


Dual Program

• Maximize i i

• Subject to:

i ic · fc (1)

i - ic · dc (2)

i ¸ 0

ic ¸ 0

Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight


Future Work

• Improve approximation ratio for Cellular Clustering

• Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables.– Linear or even sub-linear time algorithms

• Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.


THANK YOU!

QUESTIONS?

achieving anonymity via clustering

Documents

anonymity afk

distance metric

clustering r

entries of quasiidentifiers

metric spaceassume

metric space distance

maximum radius minimizationr

r members