anonymization algorithms - microaggregation and clustering

55
Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity

Upload: shina

Post on 25-Feb-2016

65 views

Category:

Documents


1 download

DESCRIPTION

Anonymization Algorithms - Microaggregation and Clustering. Li Xiong CS573 Data Privacy and Anonymity. Anonymization using Microaggregation or Clustering. Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Anonymization Algorithms -  Microaggregation and Clustering

Anonymization Algorithms - Microaggregation and Clustering

Li Xiong

CS573 Data Privacy and Anonymity

Page 2: Anonymization Algorithms -  Microaggregation and Clustering

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005

Achieving anonymity via clustering, Aggarwal, PODS 2006

Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Page 3: Anonymization Algorithms -  Microaggregation and Clustering

Anonymization Methods

Perturbative: distort the datastatistics computed on the perturbed dataset should not differ significantly from the originalmicroaggregation, additive noise

Non-perturbative: don't distort the datageneralization: combine several categories to form new less specific categorysuppression: remove values of a few attributes in some records, or entire records

Page 4: Anonymization Algorithms -  Microaggregation and Clustering

Types of data Continuous: attribute is numeric and arithmetic

operations can be performed on it Categorical: attribute takes values over a finite set and

standard arithmetic operations don't make senseOrdinal: ordered range of categories

≤, min and max operations are meaningfulNominal: unordered

only equality comparison operation is meaningful

Page 5: Anonymization Algorithms -  Microaggregation and Clustering

Measure tradeoffs

k-Anonymity: a dataset satisfies k-anonymity for k > 1 if at least k records exist for each combination of quasi-identifier values

assuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures

Page 6: Anonymization Algorithms -  Microaggregation and Clustering

Satisfying k-anonymity using generalization and suppression is NP-hard

Computational cost of finding the optimal generalization How to determine the subset of appropriate

generalizationssemantics of categories and intended use of datae.g., ZIP code:

{08201, 08205} -> 0820* makes sense{08201, 05201} -> 0*201 doesn't

Critique of Generalization/Suppression

Page 7: Anonymization Algorithms -  Microaggregation and Clustering

Problems cont. How to apply a generalization

globallymay generalize records that don't need it

locallydifficult to automate and analyzenumber of generalizations is even larger

Generalization and suppression on continuous data are unsuitable

a numeric attribute becomes categorical and loses its numeric semantics

Page 8: Anonymization Algorithms -  Microaggregation and Clustering

Problems cont. How to optimally combine generalization and

suppression is unknown Use of suppression is not homogenous

suppress entire records or only some attributes of some recordsblank a suppressed value or replace it with a neutral value

Page 9: Anonymization Algorithms -  Microaggregation and Clustering

Microaggregation/Clustering

Two steps: Partition original dataset into clusters of similar

records containing at least k records For each cluster, compute an aggregation

operation and use it to replace the original records

e.g., mean for continuous data, median for categorical data

Page 10: Anonymization Algorithms -  Microaggregation and Clustering

Advantages: a unified approach, unlike combination of generalization

and suppression Near-optimal heuristics exist Doesn't generate new categories Suitable for continuous data without removing their

numeric semantics

Page 11: Anonymization Algorithms -  Microaggregation and Clustering

Advantages cont.

Reduces data distortion K-anonymity requires an attribute to be

generalized or suppressed, even if all but one tuple in the set have the same value.

Clustering allows a cluster center to be published instead, “enabling us to release more information.”

Page 12: Anonymization Algorithms -  Microaggregation and Clustering

Original Table

Age Salary

Amy 25 50

Brian 27 60

Carol 29 100

David 35 110

Evelyn 39 120

Page 13: Anonymization Algorithms -  Microaggregation and Clustering

2-Anonymity with Generalization

Age Salary

Amy 20-30 50-100

Brian 20-30 50-100

Carol 20-30 50-100

David 30-40 100-150

Evelyn 30-40 100-150

Generalization allows pre-specified ranges

Page 14: Anonymization Algorithms -  Microaggregation and Clustering

2-Anonymity with Clustering

Age Salary

Amy [25-29] [50-100]

Brian [25-29] [50-100]

Carol [25-29] [50-100]

David [35-39] [110-120]

Evelyn [35-39] [110-120]

Cluster centers ([27,70], [37,115]) published

27=(25+27+29)/370=(50+60+100)/3

37=(35+39)/2115=(110+120)/

2

Page 15: Anonymization Algorithms -  Microaggregation and Clustering

Another example: no common value among each attribute

Page 16: Anonymization Algorithms -  Microaggregation and Clustering

Generalization vs. clustering

Generalized version of the table would need to suppress all attributes.

Clustered Version of the table would publish the cluster center as (1, 1, 1, 1), and the radius as 1.

Page 17: Anonymization Algorithms -  Microaggregation and Clustering

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005

Achieving anonymity via clustering, Aggarwal, PODS 2006

Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Page 18: Anonymization Algorithms -  Microaggregation and Clustering

Multivariate microaggregation algorithm MDAV-generic: Generic version of MDAV algorithm

(Maximum Distance to Average Vector) from previous papers

Works with any type of data (continuous, ordinal, nominal), aggregation operator and distance calculation

Page 19: Anonymization Algorithms -  Microaggregation and Clustering

MDAV-generic(R: dataset, k: integer)while |R| ≥ 3k

1. compute average record ~x of all records in R2. find most distant record xr from ~x3. find most distant record xs from xr

4. form two clusters from k-1 records closest to xr and k-1 closest to xs

5. Remove the clusters from R and run MDAV-generic onthe remaining dataset

end whileif 3k-1 ≤ |R| ≤ 2k

1. compute average record ~x of remaining records in R2. find the most distant record xr from ~x3. form a cluster from k-1 records closest to ~x4. form another cluster containing the remaining records

else (fewer than 2k records in R) form a new cluster from the remaining records

Page 20: Anonymization Algorithms -  Microaggregation and Clustering

MDAV-generic for continuous attributes use arithmetic mean and Euclidean distance standardize attributes (subtract mean and divide by

standard deviation) to give them equal weight for computing distances

After MDAV-generic, destandardize attributesxij is value of k-anonymized jth attribute for the ith record

m10(j) and m2(j) are mean and variance of the k-anonymized jth attribute

u10(j) and u2(j)are mean and variance of the original jth attribute

Page 21: Anonymization Algorithms -  Microaggregation and Clustering

MDAV-generic for ordinal attributes The distance between two categories a and b in an

attribute Vi:dord(a,b) = (|{i| ≤ i < b}|) / |D(Vi)|

i.e., the number of categories separating a and b divided by the number of categories in the attribute

Nominal attributes The distance between two values is defined according to

equality: 0 if they're equal, else 1

Page 22: Anonymization Algorithms -  Microaggregation and Clustering

Empirical Results

Continuous attributes From the U.S. Current Population Survey (1995)

1080 records described by 13 continuous attributes Computed k-anonymity for k = 3, ..., 9 and quasi-

identifiers with 6 and 13 attributes Categorical attributes

From the U.S. Housing Survey (1993) Three ordinal and eight nominal attributes Computed k-anonymity for k = 2, ..., 9 and quasi-

identifiers with 3, 4, 8 and 11 attributes

Page 23: Anonymization Algorithms -  Microaggregation and Clustering

IL measures for continuous attributes IL1 = mean variation of individual attributes in original

and k-anonymous datasets IL2 = mean variation of attribute means in both datasets IL3 = mean variation of attribute variances IL4 = mean variation of attribute covariances IL5 = mean variation of attribute Pearson's correlations IL6 = 100 times the average of IL1-6

Page 24: Anonymization Algorithms -  Microaggregation and Clustering

MDAV-generic preserves means and variances The impact on the non-preserved statistics grows with the quasi-

identifier length, as one would expect For a fixed-quasi-identifier length, the impact on the non-preserved

statistics grows with k

Page 25: Anonymization Algorithms -  Microaggregation and Clustering

IL measures for categorical attributes Dist: direct comparison of original and protected values

using a categorical distance CTBIL': mean variation of frequencies in contingency

tables for original and protected data (based on another paper by Domingo-Ferrer and Torra)

ACTBIL': CTBIL' divided by the total number of cells in all considered tables

EBIL: Entropy-based information loss (based on another paper by Domingo-Ferrer and Torra)

Page 26: Anonymization Algorithms -  Microaggregation and Clustering

Ordinal attribute protection using median

Page 27: Anonymization Algorithms -  Microaggregation and Clustering

Ordinal attribute protection using convex median

Page 28: Anonymization Algorithms -  Microaggregation and Clustering

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005

Achieving anonymity via clustering, Aggarwal, PODS 2006

Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Page 29: Anonymization Algorithms -  Microaggregation and Clustering

r-Clustering Attributes from a table are first redefined as

points in metric space. These points are clustered, and then the cluster

centers are published, rather than the original quasi-identifiers.

r is the lower bound on the number of members in each cluster.

r is used instead of k to denote the minimum degree of anonymity because k is typically used in clustering to denote the number of clusters.

Page 30: Anonymization Algorithms -  Microaggregation and Clustering

Data published for clusters

Three features are published for the clustered data the quasi-identifying attributes of the cluster

center the number of points within the cluster the set of sensitive values for the cluster

(which remain unchanged, as with k-anonymity)

A measure of the quality of the clusters will also be published.

Page 31: Anonymization Algorithms -  Microaggregation and Clustering

Defining the records in metric space Some attributes, such as age and height, are

easily mapped to metric space. Others, such as zip may first need to be

converted, for example to longitude and latitude. Some attributes may need to be scaled, such as

location, which may differ by thousands of miles. Some attributes such as race or nationality may

not convert to points in metric space easily.

Page 32: Anonymization Algorithms -  Microaggregation and Clustering

How to measure the quality of the cluster

Measures how much it distorts the original data. Maximum radius (r-GATHER problem)

Maximum radius of all clusters Cellular cost (r-CELLULAR CLUSTERING problem)

Each cluster incurs a “facility cost” to set up the cluster center.

Each cluster incurs a “service cost” which is equal to the radius times the number of points in the cluster

Sum of the facility and services costs for each of the clusters.

Page 33: Anonymization Algorithms -  Microaggregation and Clustering

25 points, radius 10

14 points, radius 8

17 points, radius 7

Points arranged in clusters

Page 34: Anonymization Algorithms -  Microaggregation and Clustering

Cluster quality measurements

Maximum radius = 10 Facility cost plus service cost:

Facility cost = f(c) Service cost = (17 x 7) + (14 x 8) + (25 x 10)

= 481

Page 35: Anonymization Algorithms -  Microaggregation and Clustering

r-GATHER problem

“The r-Gather problem is to cluster n points in a metric space into a set of clusters, such that each cluster has at least r points. The objective is to minimize the maximum radius among the clusters.”

Page 36: Anonymization Algorithms -  Microaggregation and Clustering

“Outlier” points

r-GATHER and r-CELLULAR CLUSTERING, like k-anonymity, are sensitive to outlier points (i.e., points which are far removed from the rest of the data).

The clustering solutions in this paper are generalized to allow an e fraction of outliers to be removed from the data, that is, e fraction of the tuples can be suppressed.

Page 37: Anonymization Algorithms -  Microaggregation and Clustering

(r,e)-GATHER Clustering

The (r, e)-GATHER clustering formulation of the problem allows an e fraction of the outlier points to be unclustered (i.e., these tuples are suppressed).

The paper finds that there is a polynomial time algorithm that provides a 4-approximation for the (r,e)-GATHER problem.

Page 38: Anonymization Algorithms -  Microaggregation and Clustering

r-CELLULAR CLUSTERING defined The CELLULAR CLUSTERING problem is

to arrange n points into clusters with each cluster has at least r points and with the minimum total cellular cost.

Page 39: Anonymization Algorithms -  Microaggregation and Clustering

(r,e)-CELLULAR CLUSTERING

There is also a (r,e)-CELLULAR CLUSTERING problem in which an e fraction of the points can be excluded.

The details of the constant-factor approximation of this problem are deferred to the full version of this paper.

Page 40: Anonymization Algorithms -  Microaggregation and Clustering

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005

Achieving anonymity via clustering, Aggarwal, PODS 2006

Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Page 41: Anonymization Algorithms -  Microaggregation and Clustering

41

Anonymization And Clustering

k-Member Clustering Problem From a given set of n records, find a set of clusters such

that Each cluster contains at least k records, and The total intra-cluster distance is minimized.

The problem is NP-complete

Page 42: Anonymization Algorithms -  Microaggregation and Clustering

42

Distance Metrics

Distance metric for records Measure the dissimilarities between two data

points Sum of all dissimilarities between corresponding

attributes. Numerical values Categorical values

Page 43: Anonymization Algorithms -  Microaggregation and Clustering

43

Distance between two numerical values

DefinitionLet D be a finite numeric domain. Then the normalized distance between two values vi, vj D is defined as:

where |D| is the domain size measured by the difference between the maximum and minimum values in D.

  Age Country Occupation Salary Diagnosis

r1 41 USA Armed-Forces ≥50K Cancer

r2 57 India Tech-support <50K Flu

r3 40 Canada Teacher <50K Obesity

r4 38 Iran Tech-support ≥50K Flu

r5 24 Brazil Doctor ≥50K Cancer

r6 45 Greece Salesman <50K Fever

Example 1

Distance between r1 and r2 with respect to Age attribute is |57-41|/|57-24| = 16/33 = 0.4848..

Example 2

Distance between r5 and r6 with respect to Age attribute is |24-45|/|57-24| = 21/33 = 0.6363..

Page 44: Anonymization Algorithms -  Microaggregation and Clustering

44

Distance between two categorical values

Equally different to each other. 0 if they are the same 1 if they are different

Relationships can be easily captured in a taxonomy tree.

Taxonomy tree of Country

Taxonomy tree of Occupation

Page 45: Anonymization Algorithms -  Microaggregation and Clustering

45

Distance between two categorical values

DefinitionLet D be a categorical domain and TD be a taxonomy tree defined for D. The normalized distance between two values vi, vj D is defined as:

where (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T.

Taxonomy tree of Country

Example:The distance between India and USA is 3/3 = 1.

The distance between India and Iran is 2/3 = 0.66.

Page 46: Anonymization Algorithms -  Microaggregation and Clustering

46

Distance between two records

DefinitionLet QT = {N1, . . . ,Nm, C1, . . . ,Cn} be the quasi-identifier of table T , where Ni(i = 1, . . . ,m) is an attribute with a numeric domain and Ci(i = 1, . . . , n) is an attribute with a categorical domain. The distance of two records r1, r2 T is defined as:

where δN is the distance function for numeric attribute, and δC is the distance function for categorical attribute.

Page 47: Anonymization Algorithms -  Microaggregation and Clustering

47

Distance between two records Continued…

Taxonomy tree of Country

Taxonomy tree of Occupation

  Age Country Occupation SalaryDiagno

sis

r1 41 USA Armed-Forces ≥50K Cancer

r2 57 India Tech-support <50K Flu

r3 40 Canada Teacher <50K Obesity

r4 38 Iran Tech-support ≥50K Flu

r5 24 Brazil Doctor ≥50K Cancer

r6 45 Greece Salesman <50K Fever

Example

the distance between the r1 and r2 is (16/33) + (3/3) + 1 = 2.485.

the distance between the r1 and r3 is (1/33) + (1/3) + 1 = 1.363.

Page 48: Anonymization Algorithms -  Microaggregation and Clustering

48

Cost Function - Information loss (IL) The amount of distortion (i.e., information loss) caused by the generalization

process.Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasi-identifier value in the cluster.

Definition: Let e = {r1, . . . , rk} be a cluster (i.e., equivalence class). Then the amount of information loss in e, denoted by IL(e), is defined as:

where |e| is the number of records in e, |N| represents the size of numeric domain N, (Cj) is the subtree rooted at the lowest common ancestor of every value in Cj, and H(T) is the height of tree T.

Page 49: Anonymization Algorithms -  Microaggregation and Clustering

49

Cost Function - Information loss (IL)Taxonomy tree of Country

  Age Country Occupation Salary Diagnosis

r1 41 USA Armed-Forces ≥50K Cancer

r2 57 India Tech-support <50K Flu

r3 40 Canada Teacher <50K Obesity

r4 38 Iran Tech-support ≥50K Flu

r5 24 Brazil Doctor ≥50K Cancer

r6 45 Greece Salesman <50K Fever

IL(e1) = 3 • D(e1)D(e1) = (41-24)/33 + (2/3) + 1 = 2.1818…IL(e1) = 3 • 2.1818… = 6.5454…

Age Country Occupation Salary Diagnosis

41 USA Armed-Forces ≥50K Cancer

40 Canada Teacher <50K Obesity

24 Brazil Doctor ≥50K Cancer

Cluster e1

Example

IL(e2) = 3 • D(e2)D(e2) = (57-24)/33 + (3/3) + 1 = 3IL(e2) = 3 • 3 = 9

Age Country Occupation Salary Diagnosis

41 USA Armed-Forces ≥50K Cancer

57 India Tech-support <50K Flu

24 Brazil Doctor ≥50K Cancer

Cluster e2

Page 50: Anonymization Algorithms -  Microaggregation and Clustering

50

Greedy k-member clustering algorithm

Page 51: Anonymization Algorithms -  Microaggregation and Clustering

51

Diversity MetricsThe Equal Diversity metric (ED)

assumes all sensitive attribute values are equally sensitive

where φ(e, s) = 1 if every record in e has the same s value; φ(e, s) = 0, otherwise.

Modification to the greedy algorithm:

Sensitive Diversity metric (SD) assumes there are two types of

values in a sensitive attribute: truly-sensitive not-so-sensitive

where ψ(e, s) = 1 if every record in e has the same s value that is truly-sensitive; ψ(e, s) = 0, otherwise

Modification to the greedy algorithm

Page 52: Anonymization Algorithms -  Microaggregation and Clustering

52

classification metric (CM) preserve the correlation between quasi-

identifier and class labels (non-sensitive values)

Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class label of the majority in the equivalence group.

Modification to the greedy algorithm:

Page 53: Anonymization Algorithms -  Microaggregation and Clustering

53

Experimentl Results

Experimentl Setup Data: Adult dataset from the UC Irvine Machine Learning

Repository 10 attributes (2 numeric, 7 categorical, 1 class)

Compare with 2 other algorithms Median partitioning (mondrian algorithm) k-Nearest neighbor

Page 54: Anonymization Algorithms -  Microaggregation and Clustering

54

Experimentl Results

Page 55: Anonymization Algorithms -  Microaggregation and Clustering

55

Conclusion

Transforming the k-anonymity problem to the k-member clustering problem

Overall the Greedy Algorithm produced better results compared to other algorithms at the cost of efficiency