Transcript
Page 1: Data Anonymization (1)

Data Anonymization (1)

Page 2: Data Anonymization (1)

Outline Problem concepts algorithms on domain generalization

hierarchy Algorithms on numerical data

Page 3: Data Anonymization (1)

The Massachusetts Governor Privacy Breach

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

• Zip

• Birth date

• Sex

Sweeney, IJUFKS 2002

Quasi IdentifierQuasi Identifier

87 % of US population

3

Page 4: Data Anonymization (1)

Definition Table

Column: attributes, row: records

Quasi-identifier A list of attributes that can potentially be

used to identify individuals

K-anonymity Any QI in the table appears at least k

times

Page 5: Data Anonymization (1)

Basic techniques Generalization

Zip {02138, 02139} 0213* Domain generalization hierarchy

A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure

suppression

Page 6: Data Anonymization (1)

Balance

Better privacy guaranteeLower data utility

There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility

• Suppression is required if we cannot find a k-anonymity group for a record.

Page 7: Data Anonymization (1)

Criteria Minimal generalization

Minimal generalization that satisfy the k-anonymization specification

Minimal table distortion Minimal generalization with minimal

utility loss Use precision to evaluate the loss

[sweeny papers] Application-specific utility

Page 8: Data Anonymization (1)

Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are

approximate algorithms

Page 9: Data Anonymization (1)

Shared features in different solutions Always satisfy the k-anonymity

specification If some records not, suppress them

Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric

Algorithms Assume the domain generalization hierarchy is

given Efficiency Utility maximization

Page 10: Data Anonymization (1)

Metrics to be optimized Two cost metrics – we want to minimize

(bayardo ICDE05) Discernibility

Classification The dataset has a class label column – preserving

the classification model

# of items in the k-anony group

# Records in minor classes in the group

Page 11: Data Anonymization (1)

metrics A combination of information loss and

anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

Page 12: Data Anonymization (1)

metrics Information loss

Dataset has class labels Entropy

a set S, labeled by different classes Entropy is used to calculate the impurity of labels

Information loss of a generalization G{c1,c2,…cn} p

I(G) = info(Sp) - info (Rci)

i

ii pp log Pi is the percentage of label iInfo(S)=

i p

ci

N

N

Page 13: Data Anonymization (1)

Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization

improves or does not change A(VID) Anonymity gain

P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K

x = K, otherwise

As long as k-anonymity is satisfied, further generalization of the VID does not gain

Page 14: Data Anonymization (1)

Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)

We want to minimize IPIf P(G) ==0, use I(G) only

Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)

Page 15: Data Anonymization (1)

Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up

algorithms They are all dimension-by-dimension

methods

Page 16: Data Anonymization (1)

Multidimensional techniques Categorical data?

Categories are mapped to numerize the categories

Bayardo 95 paper Order matters? (no research on that)

Numerical data K-anonymization n-dim space

partitioning Many existing techniques can be applied

Page 17: Data Anonymization (1)

Single-dimensional vs. multidimensional

Page 18: Data Anonymization (1)

The evolving procedure

Categorical(domain hierarchy)[sweeney, top-down/bottom-up]

numerized categories, single dimensional [bayardo05]

numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Page 19: Data Anonymization (1)

Method 1: Mondrain Numerize categorical data Apply a top-down partioning process

step1

Step2.1 Step2.2

Page 20: Data Anonymization (1)

Allowable cut

Page 21: Data Anonymization (1)

Method 2: spatial indexing Multidimensional spatial techniques

Kd-tree (similar to Mondrain algorithm) R-tree and its variations

R-tree R+-tree

Leaf layer

Upperlayer

Page 22: Data Anonymization (1)

Compacting bounds

Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]

Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.

Information is betterpreserved

Page 23: Data Anonymization (1)

Benefits of using R+-Tree Scalable: originally designed for

indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality

Page 24: Data Anonymization (1)

Performance

Mondrain

Page 25: Data Anonymization (1)

Utility Metrics

Discenibility penalty KL divergence: describe the difference

between a pair of distributions

Certainty penalty

Anonymized data distribution

T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Page 26: Data Anonymization (1)
Page 27: Data Anonymization (1)

Other issues Sparse high-dimensionality

Transactional data boolean matrix“On the anonymization of sparse high-dimensional

data” ICDE08 Relate to the clustering problem of

transactional data! The above one uses matrix-based clustering item based clustering (?)

Page 28: Data Anonymization (1)

Other issues Effect of numerizing categorical data

Ordering of categories may have certain impact on quality

General-purpose utility metrics vs. special task oriented utility metrics

Attacks on k-anonymity definition


Top Related